[jsword-devel] Searching and sorting accented text

DM Smith dmsmith555 at yahoo.com
Thu Aug 5 17:14:42 MST 2004


As we provide support for non-English languages we will need to be able 
to handle searches against non-ascii text that may or may not be accented.

I have noticed that some Hebrew modules and some Greek modules have 
accents and others don't. I have loaded up many of the non-ancient, 
non-English texts and they have all kinds of accent marks.

According to the UCS (often called UTF) standard an accented character 
can be represented as a character with its diacriticals/accents or as a 
base character followed by its diacriticals. Typing characters with 
these diacriticals is difficult. So many search engines strip the accent 
marks out of their indexes and normalize search requests this way too. 
Sometimes the results are humorous as accents can change the meaning of 
a word.

Sorting based upon the numerical ordering of the code points will not 
produce an ordering expected by people speaking a particular language. 
For example, in French an "e" and all of its accented variants are 
expected to be adjacent to each other when sorted. So does German, but 
in one the accents are to be before the plain "e" and in the other, it 
follows. I'm not sure about this one, but I have heard that "ch" is to 
sort at the beginning of the "c" in Spanish. In German, u-umlaut sorts 
with ue. I don't think we use sorting that much in JSword.

Sorting is a L10N/I18N issue, but striping accents is a search bug fix.

Anyway, JDK 1.4 (and 1.5 as far as I know) does not have this ability. I 
have used IBM's ICU4J to handle these issues in a project at work. With 
it you can build a localized sort key and you can decompose/recompose 
characters. For characters, you can get at the parts to build an index 
and do other things. (I don't know if there are other Java pkgs that are 
available.)

Joe, would you add ICU4J (unless there is something better or unless the 
licensing is onerous) to the jars for common or jsword (I think jsword 
makes the most sense as it will be used by the sword books/indexes)

We will need this at some point. Where should we add this to bugs.txt?
Post 1.0?

Another thing we can do once we have ICU4J is have an option to turn 
on/off accents.



More information about the jsword-devel mailing list