[jsword-devel] Searching and sorting accented text
DM Smith
dmsmith555 at yahoo.com
Thu Aug 5 17:14:42 MST 2004
As we provide support for non-English languages we will need to be able
to handle searches against non-ascii text that may or may not be accented.
I have noticed that some Hebrew modules and some Greek modules have
accents and others don't. I have loaded up many of the non-ancient,
non-English texts and they have all kinds of accent marks.
According to the UCS (often called UTF) standard an accented character
can be represented as a character with its diacriticals/accents or as a
base character followed by its diacriticals. Typing characters with
these diacriticals is difficult. So many search engines strip the accent
marks out of their indexes and normalize search requests this way too.
Sometimes the results are humorous as accents can change the meaning of
a word.
Sorting based upon the numerical ordering of the code points will not
produce an ordering expected by people speaking a particular language.
For example, in French an "e" and all of its accented variants are
expected to be adjacent to each other when sorted. So does German, but
in one the accents are to be before the plain "e" and in the other, it
follows. I'm not sure about this one, but I have heard that "ch" is to
sort at the beginning of the "c" in Spanish. In German, u-umlaut sorts
with ue. I don't think we use sorting that much in JSword.
Sorting is a L10N/I18N issue, but striping accents is a search bug fix.
Anyway, JDK 1.4 (and 1.5 as far as I know) does not have this ability. I
have used IBM's ICU4J to handle these issues in a project at work. With
it you can build a localized sort key and you can decompose/recompose
characters. For characters, you can get at the parts to build an index
and do other things. (I don't know if there are other Java pkgs that are
available.)
Joe, would you add ICU4J (unless there is something better or unless the
licensing is onerous) to the jars for common or jsword (I think jsword
makes the most sense as it will be used by the sword books/indexes)
We will need this at some point. Where should we add this to bugs.txt?
Post 1.0?
Another thing we can do once we have ICU4J is have an option to turn
on/off accents.
More information about the jsword-devel
mailing list