[sword-devel] Unicode questions
DM Smith
dmsmith555 at yahoo.com
Wed May 7 03:59:48 MST 2008
On May 7, 2008, at 1:42 AM, Ben Morgan wrote:
> Hi,
>
> Just a few questions about unicode things.
> Are VerseKeys and TKs UTF-8?
> There seems to be a few problems with some of the modules (I may be
> wrong, but they don't appear correct to me with my limited knowledge
> of unicode)
> In LewisElem, a big proportion of keys don't seem to be valid utf-8
> after ABANTIADES, for e.g.
> 'ABCI\xc2\x80\x90DO'
The actual requirement for LD modules is that the keys are strictly
ordered by their bytes. For Unicode, this will result in a collation
by code points. For the collation to be consistently meaningful UTF-8
needs to be normalized.
Earlier this week I discovered that the SWORD engine will ensure that
the keys are appropriately ordered. The new tei2mod, will normalize
the keys and data. Since it is new, there may be problems with it.
Please let us know.
If the module's conf states that the encoding is UTF-8, it is an error
for the keys and data to be something other than UTF-8. The new
tei2mod will detect whether an entry is UTF-8 or not. If it is not, it
will convert it to UTF-8.
>
>
> In esv.conf, it uses copyright symbol, but it isn't encoded in utf-8
This is an error. A conf should be encoded the same as the module. In
sections such as About that allow RTF, escape codes can also be used
for Unicode.
There are many such problems in the conf's that we have.
>
>
> In autenreith, ΙΕΡΌΣ has definition starting with
> ι<*&γτ;ερός, ἷρός:
> Is this really meant to be like this?
>
> in authenriet, the following entry is in there twice ἌΓΡΙΟΣ
> There also seem to be many other duplicates.
> I also sometimes get the error message:
> ERROR: no buffer to decompress!
Many dictionaries have duplicate keys with different data. The SWORD
engine can't handle this. So these need to me merged into a single
entry, or the SWORD engine needs to be modified to handle it.
I am surprised that you see these, I would have thought that the later
ones would have replaced the earlier ones in the idx file as the
module was being written.
More information about the sword-devel
mailing list