[sword-devel] Dotted and dotless I in Turkish & Azerbaijani, etc.

DM Smith dmsmith at crosswire.org
Sat Feb 20 08:08:22 MST 2010


On 02/20/2010 04:58 AM, David Haslam wrote:
> Please first read this article.
>
> http://en.wikipedia.org/wiki/Dotted_and_dotless_I
> http://en.wikipedia.org/wiki/Dotted_and_dotless_I
>
> Languages such as Turkish and Northern Azeri have both dotted and dotless
> letter I in their Latin-based alphabets.
>
> This has implications for letter case.
> Such alphabets break the familiar case relationship between uppercase I and
> lowercase i.
> Instead they have as a upper and lower case pairs:
>
> I and ı
> İ and i
>    
This and related issues have been discussed in recent issues for Java 
Lucene. See the following for discussions regarding Java Lucene and 
Turkish (the gossamer search is durable so will return any new/future 
conversations):
http://www.gossamer-threads.com/lists/engine?list=lucene&do=search_results&search_forum=forum_3&search_string=turkish&search_type=AND
The Jira issue for Java Lucene that corrected this is:
http://issues.apache.org/jira/browse/LUCENE-2102

Another interesting Jira issue for Java Lucene that discusses using ICU 
for normalization:
http://issues.apache.org/jira/browse/LUCENE-1488

The upshot is that previously in Java Lucene Turkish was handled 
inappropriately. The basic problem is that the lower case filters were 
not locale sensitive. That is 'I' was always lower cased to 'i'.

This was not the only issue. Here are some more:
The success of the filter depends upon whether the character is composed 
or decomposed. If it were decomposed, the combining mark was handled 
separately.

It is important to control the locale of lowercase based upon the text 
not the user's location.

The placement of the filter in the analyzer is critical. (See 
LUCENE-1488 above for a discussion).
> Questions:
> Does sword_icu properly address this in terms of case folding?
>    
I don't think that SWORD's icu library handles case folding, but rather 
transliterations.

> How does each front-end application address these issues, e.g. in terms of
> case-insensitive searches, etc?
>    
If clucene is used for searches, it is simply wrong for these cases. 
SWORD uses the StandardAnalyzer for all texts. This analyzer uses the 
LowerCaseFilter, which is not sensitive to the user's or the text's locale.

As I said earlier, SWORD needs to have an analyzer picked by language. 
StandardAnalyzer is not appropriate for many if not most of the modules 
at CrossWire.

It should not be to hard for someone (i.e. someone else, not me) to back 
port Java Lucene Turkish analyzer to clucene, whether contributed to 
clucene or put into SWORD lib. I say back port because it is part of 
Java Lucene 3.0 which is significantly different that Java Lucene 2.9 
and clucene is in the 2.x series.

In Him,
     DM

> cf.  We already have two Turkish Bible modules, and work is about to start
> on a Bible module for Northern Azeri.
>
> Working on the Go Bible for the Azerbaijani translation is how I became
> alerted to this issue.
>
> David
>
>
>    




More information about the sword-devel mailing list