[sword-devel] Dotted and dotless I in Turkish & Azerbaijani, etc.
DM Smith
dmsmith at crosswire.org
Sat Feb 20 08:08:22 MST 2010
On 02/20/2010 04:58 AM, David Haslam wrote:
> Please first read this article.
>
> http://en.wikipedia.org/wiki/Dotted_and_dotless_I
> http://en.wikipedia.org/wiki/Dotted_and_dotless_I
>
> Languages such as Turkish and Northern Azeri have both dotted and dotless
> letter I in their Latin-based alphabets.
>
> This has implications for letter case.
> Such alphabets break the familiar case relationship between uppercase I and
> lowercase i.
> Instead they have as a upper and lower case pairs:
>
> I and ı
> İ and i
>
This and related issues have been discussed in recent issues for Java
Lucene. See the following for discussions regarding Java Lucene and
Turkish (the gossamer search is durable so will return any new/future
conversations):
http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=turkish&search_type=AND
The Jira issue for Java Lucene that corrected this is:
http://issues.apache.org/jira/browse/LUCENE-2102
Another interesting Jira issue for Java Lucene that discusses using ICU
for normalization:
http://issues.apache.org/jira/browse/LUCENE-1488
The upshot is that previously in Java Lucene Turkish was handled
inappropriately. The basic problem is that the lower case filters were
not locale sensitive. That is 'I' was always lower cased to 'i'.
This was not the only issue. Here are some more:
The success of the filter depends upon whether the character is composed
or decomposed. If it were decomposed, the combining mark was handled
separately.
It is important to control the locale of lowercase based upon the text
not the user's location.
The placement of the filter in the analyzer is critical. (See
LUCENE-1488 above for a discussion).
> Questions:
> Does sword_icu properly address this in terms of case folding?
>
I don't think that SWORD's icu library handles case folding, but rather
transliterations.
> How does each front-end application address these issues, e.g. in terms of
> case-insensitive searches, etc?
>
If clucene is used for searches, it is simply wrong for these cases.
SWORD uses the StandardAnalyzer for all texts. This analyzer uses the
LowerCaseFilter, which is not sensitive to the user's or the text's locale.
As I said earlier, SWORD needs to have an analyzer picked by language.
StandardAnalyzer is not appropriate for many if not most of the modules
at CrossWire.
It should not be to hard for someone (i.e. someone else, not me) to back
port Java Lucene Turkish analyzer to clucene, whether contributed to
clucene or put into SWORD lib. I say back port because it is part of
Java Lucene 3.0 which is significantly different that Java Lucene 2.9
and clucene is in the 2.x series.
In Him,
DM
> cf. We already have two Turkish Bible modules, and work is about to start
> on a Bible module for Northern Azeri.
>
> Working on the Go Bible for the Azerbaijani translation is how I became
> alerted to this issue.
>
> David
>
>
>
More information about the sword-devel
mailing list