[sword-devel] Dictionary ordering

DM Smith dmsmith555 at yahoo.com
Sun Sep 14 09:47:37 MST 2008


Found an interesting thread on ordering dictionary entries. Currently  
we use byte ordering for the key which results in UTF-8 or Latin-1  
(cp1252) code point ordering.

The issue discusses Farsi but may pertain to other languages as well.  
The issue is that the order of code points may not be the letters in  
an alphabet. If I understand correctly, Farsi starts with Arabic, but  
inserts some extra letters into the sequence. These have code points  
that are not sequenced where they are expected.

Here is the thread: http://www.nabble.com/lucene-farsi-problem-td16977096.html

I think that for TEI modules, we should use key/sortKey (which ever is  
in P5) as an internal specifier of sorting, expecting the input  
document to be sorted on that key, or we should assume that the  
ordering of the dictionary entries is correct and generate the sortKey.

In Java the traditional methodology is to use RuleBaseCollator that  
for a given locale will generate a key for each string that can then  
be used for sorting and searching. I think this is part of ICU.

In Him,
	DM




More information about the sword-devel mailing list