[jsword-devel] Lucene Analyzers patch

Tue Oct 23 21:33:01 MST 2007

Hi,

I have been wanting to improve the lucene analyzers, used during indexing
and search. Attached is the patch (finally !) that uses analyzers based on
the bible language. Following is the change summary:

1. Introduced AnalyzerFactory that uses property file to instantiate
analyzers based on the book language. AnalyzerFactory is used only for the
"content" field, all other fields like key/strongs/xref/notes are
unaffected.
AnalyzerFactory.properties provide configuration for stemming, stopwords and
Analyzer class to use (on per language basis). By default stop words are NOT
removed and stemming is done (if available for the book language).

2. Stemming is done for all languages available through snowball (lucene
snowball package net.sf.snowball.ext) and lucene contrib (e.g GreekAnalyzer
in http://lucene.apache.org/java/2_2_0/api/).
Stemming done for: Snowball langs (Danish, Dutch, English,
Finnish,French,German,Italian,Norwegian, Portuguese,Russian,Spanish,Swedish)

3. Tokenization corrected for: Czech, Greek, Chinese, Japanese & Thai
Chinese/Japanese/Thai now get tokenized on every character (SimpleAnalyzer
tokenization was breaking for these langs).

4. Accented characters are normalized (for ISO Latin-1 languages only) in
SimpleLuceneAnalyzer.java. This is the default analyzer used for all
languages, if another implementation is not specified in the properties.
This default analyzer is similar to lucene SimpleAnalyzer, with accented
character normalization.

5.
EnglishLuceneAnalyzer.java works like lucene SimpleAnalyzer + Stemming.
(LowerCaseTokenizer  > PorterStemFilter). Stop word filter is off by
default.

6.
IndexMetadata.properties specifies the index version. Current BD user who do
not want to reindex, should be able to search with no problem. I am not sure
what option to use for presenting the user with an option in the UI for
upgrading index.
For index versioning, I came us with following based on my knowledge of
jsword index history:
   1.0 : Original index format. Uses: fields = key,content; Analyzer =
SimpleAnalyzer
   1.1 : Added field = strong, heading, xref, note
   1.2 : Added natural language analysis (Stemming, CJK tokenization,
optionally Stopword)
Note: I am keeping version as 1.1(from BD 1.0.7) by default. If you want to
test this patch you will have to change the following in
IndexMetadata.properties :
        Installed.Index.Version=1.2

=========================================================
Testing Done:
-Junit tests for AnalyzerFactory and language analyzers
-Tested BD search for all major language categories.
-Tested that BD 1.0.7 index is searchable with this patch when
Installed.Index.Version=1.1 in IndexMetadata.properties.

Related Jira Issues:
JS-21 Add the ability to search by word stems : Done for lucene analysis
supported languages
JS-18 Dont index accents : Done for latin-1 languages

ChangeList:
-lucene package
- new analysis package
-New jars: lucene-analyzers-2.2.0.jar , lucene-snowball-2.2.0.jar
-Junit tests
-Commons ant script

=========================================================

I will appreciate all comments/reviews, specially testing the search in
multiple language bibles. To test this patch:
1. In IndexMetadata.properties, change to Installed.Index.Version=1.2
2. Reindex bible in BD(by deleting the index first), then search
3. Changing logging of org.crosswire.jsword.index.lucene.LuceneIndex to
FINE, will print the parsed query, for every 'search' in bibledesktop

Looking forward to hear feedbacks,
Sijo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20071024/103eacb4/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jsword.newanalyzers.patch
Type: application/octet-stream
Size: 126426 bytes
Desc: not available
Url : http://www.crosswire.org/pipermail/jsword-devel/attachments/20071024/103eacb4/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: common.antscript.patch
Type: application/octet-stream
Size: 524 bytes
Desc: not available
Url : http://www.crosswire.org/pipermail/jsword-devel/attachments/20071024/103eacb4/attachment-0003.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lucene-snowball-2.2.0.jar
Type: application/java-archive
Size: 90882 bytes
Desc: not available
Url : http://www.crosswire.org/pipermail/jsword-devel/attachments/20071024/103eacb4/attachment-0002.jar 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lucene-analyzers-2.2.0.jar
Type: application/java-archive
Size: 72468 bytes
Desc: not available
Url : http://www.crosswire.org/pipermail/jsword-devel/attachments/20071024/103eacb4/attachment-0003.jar