[sword-devel] imp2ld and alphabetization
DM Smith
dmsmith555 at yahoo.com
Mon Oct 29 05:39:19 MST 2007
On Oct 29, 2007, at 12:49 AM, Chris Little wrote:
> DM Smith wrote:
>> I'm not sure if I am reading the Sword code correctly, but it appears
>> that it is sorting at a byte level and not a character level. That
>> isn't by code points.
>
> I'm pretty sure you're right about what Sword is actually doing, but I
> believe it's also codepoint order, just by the nature of UTF-8
> itself. I
> could be wrong.
The comparison is from left to right and it stops after the first
difference returning the difference of the values of the bytes.
If the first difference compares a letter with a code point < 128
(one byte) with one >= 127 (two or more bytes) it will compare the
entire code point of the first with just the high order byte of the
code point of the second.
When comparing a 2-byte to a 3-byte UTF-8, the result will be the
same, the first bytes are different in code point order. The first
one begins with a pattern of 110xxxxx and the other with 1110xxxx.
Same goes for other comparisons. For any two code points the UTF-8
representation have the significant ordering information in the high
order bytes.
So while not comparing code point values, it is in code point order.
>
>> One simple way for any application to provide this is to create a
>> Lucene index similar to what we do for a Bible for the dictionary
>
> I don't think mandating Lucene in order to access the contents of a
> module is a simple solution. We can't require Lucene without
> cutting off
> a number of supported platform. For example, it is unreasonable to
> require Lucene on handheld platforms like PocketPC and MacSword
> would be
> obligated to use Lucene just to read LD modules.
I agree. I was suggesting an application level solution. Just like we
provide for Bibles. It doesn't have to be compiled into the
application, in which case, the original behavior is the only one.
I'm not sure that a solution can be provided for all front-ends. My
PDA and my phone cannot show the characters anyway as they don't have
appropriate fonts.
>
> We might be able to do a lexicon with the GenBook driver and just keep
> every entry at the same level. I don't know how badly this would hurt
> key lookup.
Troy did something like this with Heyschius. (not sure I spelled it
right) So we can test it.
>
>> There are some related problems to this:
>> A user may expect to be able to find a Hebrew word in a Hebrew
>> dictionary independent of the pointing of the word in the dictionary.
>> (i.e. a user may wish to search without specifying accents)
>
> It's possible to have multiple keys share a single entry. So
> pointed and
> an unpointed keys can point to the same entry. We've done this
> experimentally with dictionaries in the past to permit lookup by a
> Strong's number or the lemma it represents.
That works but then all current front-ends would show two entries.
>
>> A user may expect to find a word by stem not just by prefix.
>
> I'm not sure whether this is a sort order issue or lookup/search
> issue.
> Presumably a user would know the word they want and type it in with
> its
> prefix, even if it is sorted to group with other words sharing the
> same
> stem.
Maybe I am not using the right terminology. Let's say that "run" is
in the dictionary but "ran" is not because this dictionary only has
the base words and no grammatical variations. Now the user right
clicks on "ran" and chooses lookup and is brought to the nearest word
to "ran", perhaps "rabid". This is a simple case. It has been quite a
while since I studied other languages, but I seem to remember that
German changes the prefix of words when going to the past tense. And
in Greek, I seem to remember diacritic changes and suffix changes.
>
>> A user may expect to be able to type "photos" (a transliteration) and
>> find the real Greek word in a Greek dictionary.
>
> I'm willing to write these users off. We could transliterate back to
> Greek, but I don't think it's worth the effort or processor cycles. I
> don't believe that people who don't know how to read Greek use Greek
> lexicons other than as a novelty.
I was thinking altogether of a different user. For example I use
Windows, Linux and Macs almost daily and I do not want to learn each
OSes input system and just wants to find words by typing (like Beta
Greek) It is not a matter of reading but of entry.
Another example, many of our dictionaries have transliterations for
their terms. If an application were to lookup across all dictionaries
but didn't lookup by transliteration, the word would not be found.
As to the processor cycles, it will be small. "Between the
keystrokes" as one of my friends used to say.
More information about the sword-devel
mailing list