[jsword-devel] Accenting in Greek

DM Smith dmsmith at crosswire.org
Wed Sep 19 10:59:15 MST 2012


I think there is an issue for this in Jira. Don't have time at the moment to look. It is a problem and a more general one than just Greek and Hebrew, but any text that is not plain ascii.

A single glyph (e.g. an accented/decorated character) can be made up of several UTF-8 code points (decomposed) or just one (composed). To have a valid search the search request and the index have to be normalized the same.

When a user enters a search request with accents from a keyboard, they will often create a form that is decomposed. It also won't search an accented text well if the index has it as composed.

It needs to be fixed. In the latest Lucene there are some new analyzers that can greatly simplify all this.

In Him,
	DM

On Sep 19, 2012, at 1:48 PM, Chris Burrell <chris at burrell.me.uk> wrote:

> Hi all
> 
> As far as I can tell our current GreekAnalyzer takes into account the accenting of the underlying text, rendering searches across both unaccented and accented texts impossible (i.e. copy paste from unaccented text and search an accented text).
> 
> I'm considering adding a new filter to the GreekAnalyzer to strip out the accents. However, I assume this could have undesirable effects since the accenting in Greek can sometimes change the meaning of the word completely.
> 
> Any other ideas as to how we might do this? (same goes for Hebrew pointing/vowels/etc.)
> 
> I already have a "unAccent" method (shout if you want an untested copy). The question is whether we want to include it in the filter, or how else to do the search accurately. Any thoughts welcome.
> 
> Chris
> 
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel




More information about the jsword-devel mailing list