[jsword-devel] Search improvements
DM Smith
dmsmith at crosswire.org
Mon May 4 04:47:20 MST 2009
FYI:
Lucene is about to release 2.9. After this release, Lucene will
require Java 5.0.
The 2.9 has quite a few performance improvements. (I even contributed
one.) Every little bit helps.
But, there are significant changes on the analyzer and stemming front.
Analyzers are used to properly tokenize the language. Stemming is a
kind of analysis that heuristically identifies root words.
This is one area JSword benefits, but it requires code changes.
Here is a quick (I may have missed some):
Chinese: While we already use a Chinese analyzer, there is a new
dictionary based and improves accuracy significantly.
Arabic analyzer: We currently don't use this one. We should add it.
Persian analyzer: This one has just been submitted and is now targeted
for the 2.9 release. It is significantly more accurate than the Arabic
analyzer, which we don't use.
German: There is a new dictionary based method to break compound words
into searchable parts.
New Snowball stemmers:
Hungarian
Romanian
Turkish
There are some significant improvements regarding the parsing of
Latin-1 languages.
Improving any language's analysis will invalidate, at least in part,
existing indexes for those languages. To make such a change means
needing to finish the versioning of the index, which I outlined in an
earlier thread.
In Him,
DM
More information about the jsword-devel
mailing list