[sword-devel] Greek dictionary - input needed
DM Smith
dmsmith555 at yahoo.com
Tue Jan 20 12:11:57 MST 2009
Chris Little wrote:
> Daniel Owens wrote:
>> The other MAJOR problem is that the dictionary keys are always
>> capitalized, which makes it really awkward to use for Greek. Can I
>> lobby again for a change in that? How many Greek students are used to
>> looking up words in capitals? I was taught using lower case letters,
>> and many of the capitals are really fuzzy. When reading I can work
>> them out based on context sometimes because I have the rest of the
>> word to clue me in. Capitals also make accent marks look strange.
>> Then there is the issue of sort order again...
>
> I quite agree. No casing language uses capital forms primarily.
> Capital letters are less recognizable and slow reading speed.
>
> That said, I don't quite know how we ought to solve the issue. We
> can't simply lowercase the existing keys, since many would actually
> need to incorporate capitals (e.g. personal & place names). And we'll
> need to do some kind of case folding when we do key lookups.
>
> Making keys be cased and doing case folding at runtime handles part of
> the issue. However, key sorting becomes more difficult and we have to
> guard against the possibility of keys that are identical except for
> casing (e.g. "a" and "A").
It is a hard problem, but not intractable. I think it might require a
new module type.
Some thoughts: Lookup and collation are two different problems that have
a single solution today. Lookup is the process of taking input and
finding one or more entries. Collation is the ordering of entries for
the purpose of display. These don't have to have a single solution.
Today, our modules use a strict byte ordering of the upper case
representation of each entry's term. For latin-1/cp1252, this gives a
well-defined, though sometimes inappropriate ordering. For UTF-8, the
situation is more complex. For a given glyph, there can be more than one
representation in UTF-8. For example, an accented letter may be a single
code point or may multiple code points, with the letter followed by the
accents in any order. Without normalization of the entry's term (we've
settled on NFC), the ordering is not well-defined. With it, it is. But
again, it may produce inappropriate ordering.
Given this well-defined order, lookup of a word begins by converting it
to upper case (and it should also include converting it to NFC when
looking up in a UTF-8 dictionary) and then a binary search can be
performed. This will result in the nearest first match in the collated list.
When each dictionary module is built, the input file does not need to be
ordered. As each entry is added it is stored against the normalized key.
If a subsequent entry normalizes to the same as a prior one, the key
will no longer point to the first, but will point to the subsequent. (On
a side note, the dat file will contain the first entry, and if nothing
points to it, it will be orphaned.
One of the impacts of this mechanism is that there cannot be two entries
with the same "key". Many dictionaries have multiple entries with the
same key. I think we should have a solution that provides for this.
I think that there needs to be a notion of an internal, sort key and an
external, display key on a per entry basis. Lookup would need be against
the internal key. So a routine would be needed to convert/normalize
input into the form of the internal key and use that for lookup.
ICU has the notion of a collation key, which can be used for such a
purpose. (I think we've gotten to the point where ICU is a requirement
for UTF-8 modules.) In ICU, the collation key is locale dependent. (For
example, Germans sort accent marks differently than French. In Spanish
dictionaries, at least older ones, ch come before ca.) I really don't
see any way around having a static collation for a module. If so, the
collation would need to be fixed wrt either a fixed locale or a locale
based upon the language of the module.
The other aspect of lookup is that we will be producing accented
dictionaries. But we want the dictionaries to work for unaccented texts.
For example, we have unaccented Greek texts and it is possible to show
Hebrew without vowel points or cantillation. The next round of Greek and
Hebrew dictionaries will have accents, vowel points. It should work to
find one or more accented words that match an unaccented input.
We may also want to tackle lookup by transliteration.
For us to have multiple lookup mechanisms but a single collation, I
think this argues for separating lookup from collation. I don't think we
want to show all the different ways an entry is indexed.
So, lookup depends on normalized input that matches normalized
index(es). The result of a lookup is an entry which has a position in a
collation.
As to solving the unique key problem, tei2mod could be changed to check
to see if there is already an entry with that normalized key. If there
is, then append a non-printing character to the end and try again. Or
simply change the engine to allow duplicates.
I implemented this many years ago in perl to run on a computer with 128M
RAM. To see it, go to: http://nexis.com/sources
Some info:
Search and sorting are independent.
Each entry is indexed on several keys. Lookup can be against one or more
of them.
There can be more than one entry with the same key.
The search result is ordered according to the end-user's locale as
provided by their browser, if that locale is supported, otherwise it
goes to a default ordering.
You will notice that the ordering takes noise words into account and
properly orders numbers. You might notice other complexities too. All of
it is handled by normalization and then generating a collation key for
the appropriate locale(s).
In Him,
DM
More information about the sword-devel
mailing list