[sword-devel] imp2ld and alphabetization

Mon Oct 29 05:39:19 MST 2007

On Oct 29, 2007, at 12:49 AM, Chris Little wrote:

> DM Smith wrote:
>> I'm not sure if I am reading the Sword code correctly, but it appears
>> that it is sorting at a byte level and not a character level. That
>> isn't by code points.
>
> I'm pretty sure you're right about what Sword is actually doing, but I
> believe it's also codepoint order, just by the nature of UTF-8  
> itself. I
> could be wrong.

The comparison is from left to right and it stops after the first  
difference returning the difference of the values of the bytes.

If the first difference compares a letter with a code point < 128  
(one byte) with one >= 127 (two or more bytes) it will compare the  
entire code point of the first with just the high order byte of the  
code point of the second.

When comparing a 2-byte to a 3-byte UTF-8, the result will be the  
same, the first bytes are different in code point order. The first  
one begins with a pattern of 110xxxxx and the other with 1110xxxx.  
Same goes for other comparisons. For any two code points the UTF-8  
representation have the significant ordering information in the high  
order bytes.

So while not comparing code point values, it is in code point order.

>
>> One simple way for any application to provide this is to create a
>> Lucene index similar to what we do for a Bible for the dictionary
>
> I don't think mandating Lucene in order to access the contents of a
> module is a simple solution. We can't require Lucene without  
> cutting off
> a number of supported platform. For example, it is unreasonable to
> require Lucene on handheld platforms like PocketPC and MacSword  
> would be
> obligated to use Lucene just to read LD modules.

I agree. I was suggesting an application level solution. Just like we  
provide for Bibles. It doesn't have to be compiled into the  
application, in which case, the original behavior is the only one.

I'm not sure that a solution can be provided for all front-ends. My  
PDA and my phone cannot show the characters anyway as they don't have  
appropriate fonts.

>
> We might be able to do a lexicon with the GenBook driver and just keep
> every entry at the same level. I don't know how badly this would hurt
> key lookup.

Troy did something like this with Heyschius. (not sure I spelled it  
right) So we can test it.

>
>> There are some related problems to this:
>> A user may expect to be able to find a Hebrew word in a Hebrew
>> dictionary independent of the pointing of the word in the dictionary.
>> (i.e. a user may wish to search without specifying accents)
>
> It's possible to have multiple keys share a single entry. So  
> pointed and
> an unpointed keys can point to the same entry. We've done this
> experimentally with dictionaries in the past to permit lookup by a
> Strong's number or the lemma it represents.

That works but then all current front-ends would show two entries.

>
>> A user may expect to find a word by stem not just by prefix.
>
> I'm not sure whether this is a sort order issue or lookup/search  
> issue.
> Presumably a user would know the word they want and type it in with  
> its
> prefix, even if it is sorted to group with other words sharing the  
> same
> stem.

Maybe I am not using the right terminology. Let's say that "run" is  
in the dictionary but "ran" is not because this dictionary only has  
the base words and no grammatical variations. Now the user right  
clicks on "ran" and chooses lookup and is brought to the nearest word  
to "ran", perhaps "rabid". This is a simple case. It has been quite a  
while since I studied other languages, but I seem to remember that  
German changes the prefix of words when going to the past tense. And  
in Greek, I seem to remember diacritic changes and suffix changes.

>
>> A user may expect to be able to type "photos" (a transliteration) and
>> find the real Greek word in a Greek dictionary.
>
> I'm willing to write these users off. We could transliterate back to
> Greek, but I don't think it's worth the effort or processor cycles. I
> don't believe that people who don't know how to read Greek use Greek
> lexicons other than as a novelty.

I was thinking altogether of a different user. For example I use  
Windows, Linux and Macs almost daily and I do not want to learn each  
OSes input system and just wants to find words by typing (like Beta  
Greek) It is not a matter of reading but of entry.

Another example, many of our dictionaries have transliterations for  
their terms. If an application were to lookup across all dictionaries  
but didn't lookup by transliteration, the word would not be found.

As to the processor cycles, it will be small. "Between the  
keystrokes" as one of my friends used to say.