[sword-devel] Dictionary ordering
DM Smith
dmsmith555 at yahoo.com
Sun Sep 14 09:47:37 MST 2008
Found an interesting thread on ordering dictionary entries. Currently
we use byte ordering for the key which results in UTF-8 or Latin-1
(cp1252) code point ordering.
The issue discusses Farsi but may pertain to other languages as well.
The issue is that the order of code points may not be the letters in
an alphabet. If I understand correctly, Farsi starts with Arabic, but
inserts some extra letters into the sequence. These have code points
that are not sequenced where they are expected.
Here is the thread: http://www.nabble.com/lucene-farsi-problem-td16977096.html
I think that for TEI modules, we should use key/sortKey (which ever is
in P5) as an internal specifier of sorting, expecting the input
document to be sorted on that key, or we should assume that the
ordering of the dictionary entries is correct and generate the sortKey.
In Java the traditional methodology is to use RuleBaseCollator that
for a given locale will generate a key for each string that can then
be used for sorting and searching. I think this is part of ICU.
In Him,
DM
More information about the sword-devel
mailing list