[jsword-devel] Search improvements

Mon May 4 04:47:20 MST 2009

FYI:

Lucene is about to release 2.9. After this release, Lucene will  
require Java 5.0.

The 2.9 has quite a few performance improvements. (I even contributed  
one.) Every little bit helps.

But, there are significant changes on the analyzer and stemming front.  
Analyzers are used to properly tokenize the language. Stemming is a  
kind of analysis that heuristically identifies root words.

This is one area JSword benefits, but it requires code changes.

Here is a quick (I may have missed some):

Chinese: While we already use a Chinese analyzer, there is a new  
dictionary based and improves accuracy significantly.

Arabic analyzer: We currently don't use this one. We should add it.

Persian analyzer: This one has just been submitted and is now targeted  
for the 2.9 release. It is significantly more accurate than the Arabic  
analyzer, which we don't use.

German: There is a new dictionary based method to break compound words  
into searchable parts.

New Snowball stemmers:
	Hungarian
	Romanian
	Turkish

There are some significant improvements regarding the parsing of  
Latin-1 languages.

Improving any language's analysis will invalidate, at least in part,  
existing indexes for those languages. To make such a change means  
needing to finish the versioning of the index, which I outlined in an  
earlier thread.

In Him,
	DM