[sword-devel] Greek dictionary - input needed

Tue Jan 20 12:11:57 MST 2009

Chris Little wrote:
> Daniel Owens wrote:
>> The other MAJOR problem is that the dictionary keys are always 
>> capitalized, which makes it really awkward to use for Greek. Can I 
>> lobby again for a change in that? How many Greek students are used to 
>> looking up words in capitals? I was taught using lower case letters, 
>> and many of the capitals are really fuzzy. When reading I can work 
>> them out based on context sometimes because I have the rest of the 
>> word to clue me in. Capitals also make accent marks look strange. 
>> Then there is the issue of sort order again...
>
> I quite agree. No casing language uses capital forms primarily. 
> Capital letters are less recognizable and slow reading speed.
>
> That said, I don't quite know how we ought to solve the issue. We 
> can't simply lowercase the existing keys, since many would actually 
> need to incorporate capitals (e.g. personal & place names). And we'll 
> need to do some kind of case folding when we do key lookups.
>
> Making keys be cased and doing case folding at runtime handles part of 
> the issue. However, key sorting becomes more difficult and we have to 
> guard against the possibility of keys that are identical except for 
> casing (e.g. "a" and "A"). 

It is a hard problem, but not intractable. I think it might require a 
new module type.

Some thoughts: Lookup and collation are two different problems that have 
a single solution today. Lookup is the process of taking input and 
finding one or more entries. Collation is the ordering of entries for 
the purpose of display. These don't have to have a single solution.

Today, our modules use a strict byte ordering of the upper case 
representation of each entry's term. For latin-1/cp1252, this gives a 
well-defined, though sometimes inappropriate ordering. For UTF-8, the 
situation is more complex. For a given glyph, there can be more than one 
representation in UTF-8. For example, an accented letter may be a single 
code point or may multiple code points, with the letter followed by the 
accents in any order. Without normalization of the entry's term (we've 
settled on NFC), the ordering is not well-defined. With it, it is. But 
again, it may produce inappropriate ordering.

Given this well-defined order, lookup of a word begins by converting it 
to upper case (and it should also include converting it to NFC when 
looking up in a UTF-8 dictionary) and then a binary search can be 
performed. This will result in the nearest first match in the collated list.

When each dictionary module is built, the input file does not need to be 
ordered. As each entry is added it is stored against the normalized key. 
If a subsequent entry normalizes to the same as a prior one, the key 
will no longer point to the first, but will point to the subsequent. (On 
a side note, the dat file will contain the first entry, and if nothing 
points to it, it will be orphaned.

One of the impacts of this mechanism is that there cannot be two entries 
with the same "key". Many dictionaries have multiple entries with the 
same key. I think we should have a solution that provides for this.

I think that there needs to be a notion of an internal, sort key and an 
external, display key on a per entry basis. Lookup would need be against 
the internal key. So a routine would be needed to convert/normalize 
input into the form of the internal key and use that for lookup.

ICU has the notion of a collation key, which can be used for such a 
purpose. (I think we've gotten to the point where ICU is a requirement 
for UTF-8 modules.) In ICU, the collation key is locale dependent. (For 
example, Germans sort accent marks differently than French. In Spanish 
dictionaries, at least older ones, ch come before ca.) I really don't 
see any way around having a static collation for a module. If so, the 
collation would need to be fixed wrt either a fixed locale or a locale 
based upon the language of the module.

The other aspect of lookup is that we will be producing accented 
dictionaries. But we want the dictionaries to work for unaccented texts. 
For example, we have unaccented Greek texts and it is possible to show 
Hebrew without vowel points or cantillation. The next round of Greek and 
Hebrew dictionaries will have accents, vowel points. It should work to 
find one or more accented words that match an unaccented input.

We may also want to tackle lookup by transliteration.

For us to have multiple lookup mechanisms but a single collation, I 
think this argues for separating lookup from collation. I don't think we 
want to show all the different ways an entry is indexed.

So, lookup depends on normalized input that matches normalized 
index(es). The result of a lookup is an entry which has a position in a 
collation.

As to solving the unique key problem, tei2mod could be changed to check 
to see if there is already an entry with that normalized key. If there 
is, then append a non-printing character to the end and try again. Or 
simply change the engine to allow duplicates.

I implemented this many years ago in perl to run on a computer with 128M 
RAM. To see it, go to: http://nexis.com/sources
Some info:
Search and sorting are independent.
Each entry is indexed on several keys. Lookup can be against one or more 
of them.
There can be more than one entry with the same key.

The search result is ordered according to the end-user's locale as 
provided by their browser, if that locale is supported, otherwise it 
goes to a default ordering.
You will notice that the ordering takes noise words into account and 
properly orders numbers. You might notice other complexities too. All of 
it is handled by normalization and then generating a collation key for 
the appropriate locale(s).

In Him,
    DM