[jsword-devel] Lucene Analyzers patch

Fri Oct 26 11:51:50 MST 2007

Sijo,
    Sorry for not replying earlier. I have applied the patch, modifying 
it slightly to clean up "checkstyle" complaints and checking it in. (It 
has built cleanly on the server and is part of the nightly download) 
There are a couple of tests that fail. Not sure why. I have not had much 
time to look at it, but I like what I saw. Hopefully, I'll be able to 
review it carefully and make suggestions. :)

Many, many thanks!

In Him,
DM

Sijo Cherian wrote:
> Hi,
>
> I have been wanting to improve the lucene analyzers, used during 
> indexing and search. Attached is the patch (finally !) that uses 
> analyzers based on the bible language. Following is the change summary:
>
> 1. Introduced AnalyzerFactory that uses property file to instantiate 
> analyzers based on the book language. AnalyzerFactory is used only for 
> the "content" field, all other fields like key/strongs/xref/notes are 
> unaffected.
> AnalyzerFactory.properties provide configuration for stemming, 
> stopwords and Analyzer class to use (on per language basis). By 
> default stop words are NOT removed and stemming is done (if available 
> for the book language).
>
> 2. Stemming is done for all languages available through snowball 
> (lucene snowball package net.sf.snowball.ext) and lucene contrib (e.g 
> GreekAnalyzer in http://lucene.apache.org/java/2_2_0/api/ 
> <http://lucene.apache.org/java/2_2_0/api/>).
> Stemming done for: Snowball langs (Danish, Dutch, English, 
> Finnish,French,German,Italian,Norwegian, 
> Portuguese,Russian,Spanish,Swedish)
>
> 3. Tokenization corrected for: Czech, Greek, Chinese, Japanese & Thai
> Chinese/Japanese/Thai now get tokenized on every character 
> (SimpleAnalyzer tokenization was breaking for these langs).
>
> 4. Accented characters are normalized (for ISO Latin-1 languages only) 
> in SimpleLuceneAnalyzer.java . This is the default analyzer used for 
> all languages, if another implementation is not specified in the 
> properties.
> This default analyzer is similar to lucene SimpleAnalyzer, with 
> accented character normalization.
>
> 5.
> EnglishLuceneAnalyzer.java works like lucene SimpleAnalyzer + 
> Stemming. (LowerCaseTokenizer  > PorterStemFilter). Stop word filter 
> is off by default.
>
> 6.
> IndexMetadata.properties specifies the index version. Current BD user 
> who do not want to reindex, should be able to search with no problem. 
> I am not sure what option to use for presenting the user with an 
> option in the UI for upgrading index.
> For index versioning, I came us with following based on my knowledge 
> of jsword index history:
>    1.0 : Original index format. Uses: fields = key,content; Analyzer = 
> SimpleAnalyzer
>    1.1 : Added field = strong, heading, xref, note
>    1.2 : Added natural language analysis (Stemming, CJK tokenization, 
> optionally Stopword)
> Note: I am keeping version as 1.1(from BD 1.0.7) by default. If you 
> want to test this patch you will have to change the following in 
> IndexMetadata.properties :
>         Installed.Index.Version=1.2
>
> =========================================================  
> Testing Done:
> -Junit tests for AnalyzerFactory and language analyzers
> -Tested BD search for all major language categories.
> -Tested that BD 1.0.7 index is searchable with this patch when 
> Installed.Index.Version=1.1 in IndexMetadata.properties.
>
> Related Jira Issues:
> JS-21 Add the ability to search by word stems : Done for lucene 
> analysis supported languages
> JS-18 Dont index accents : Done for latin-1 languages
>
> ChangeList:
> -lucene package
> - new analysis package
> -New jars: lucene-analyzers-2.2.0.jar , lucene-snowball-2.2.0.jar
> -Junit tests
> -Commons ant script
>
> =========================================================
>
> I will appreciate all comments/reviews, specially testing the search 
> in multiple language bibles. To test this patch:
> 1. In IndexMetadata.properties , change to Installed.Index.Version=1.2
> 2. Reindex bible in BD(by deleting the index first), then search
> 3. Changing logging of org.crosswire.jsword.index.lucene.LuceneIndex 
> to FINE, will print the parsed query, for every 'search' in bibledesktop
>
>
> Looking forward to hear feedbacks,
> Sijo
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>