Hi,<br><br>I have been wanting to improve the lucene analyzers, used during indexing and search. Attached is the patch (finally !) that uses analyzers based on the bible language. Following is the change summary:<br><br>1. Introduced AnalyzerFactory that uses property file to instantiate analyzers based on the book language. AnalyzerFactory is used only for the "content" field, all other fields like key/strongs/xref/notes are unaffected.
<br>AnalyzerFactory.properties provide configuration for stemming, stopwords and Analyzer class to use (on per language basis). By default stop words are NOT removed and stemming is done (if available for the book language).
<br><br>2. Stemming is done for all languages available through snowball (lucene snowball package net.sf.snowball.ext) and lucene contrib (e.g GreekAnalyzer in <a href="http://lucene.apache.org/java/2_2_0/api/">http://lucene.apache.org/java/2_2_0/api/
</a>).<br>Stemming done for: Snowball langs (Danish, Dutch, English, Finnish,French,German,Italian,Norwegian, Portuguese,Russian,Spanish,Swedish) <br><br>3. Tokenization corrected for: Czech, Greek, Chinese, Japanese & Thai
<br>Chinese/Japanese/Thai now get tokenized on every character (SimpleAnalyzer tokenization was breaking for these langs).<br><br>4. Accented characters are normalized (for ISO Latin-1 languages only) in SimpleLuceneAnalyzer.java
. This is the default analyzer used for all languages, if another implementation is not specified in the properties.<br>This default analyzer is similar to lucene SimpleAnalyzer, with accented character normalization.<br>
<br>5. <br>EnglishLuceneAnalyzer.java works like lucene SimpleAnalyzer + Stemming. (LowerCaseTokenizer > PorterStemFilter). Stop word filter is off by default. <br><br>6.<br>IndexMetadata.properties specifies the index version. Current BD user who do not want to reindex, should be able to search with no problem. I am not sure what option to use for presenting the user with an option in the UI for upgrading index.
<br>For index versioning, I came us with following based on my knowledge of jsword index history:<br> 1.0 : Original index format. Uses: fields = key,content; Analyzer = SimpleAnalyzer<br> 1.1 : Added field = strong, heading, xref, note
<br> 1.2 : Added natural language analysis (Stemming, CJK tokenization, optionally Stopword)<br>Note: I am keeping version as 1.1(from BD 1.0.7) by default. If you want to test this patch you will have to change the following in
IndexMetadata.properties :<br> Installed.Index.Version=1.2<br><br>========================================================= <br>Testing Done:<br>-Junit tests for AnalyzerFactory and language analyzers<br>-Tested BD search for all major language categories.
<br>-Tested that BD 1.0.7 index is searchable with this patch when Installed.Index.Version=1.1 in IndexMetadata.properties.<br><br>Related Jira Issues:<br>JS-21 Add the ability to search by word stems : Done for lucene analysis supported languages
<br>JS-18 Dont index accents : Done for latin-1 languages<br><br>ChangeList:<br>-lucene package<br>- new analysis package<br>-New jars: lucene-analyzers-2.2.0.jar , lucene-snowball-2.2.0.jar<br>-Junit tests<br>-Commons ant script
<br><br>=========================================================<br><br>I will appreciate all comments/reviews, specially testing the search in multiple language bibles. To test this patch:<br>1. In IndexMetadata.properties
, change to Installed.Index.Version=1.2<br>2. Reindex bible in BD(by deleting the index first), then search<br>3. Changing logging of org.crosswire.jsword.index.lucene.LuceneIndex to FINE, will print the parsed query, for every 'search' in bibledesktop
<br><br><br>Looking forward to hear feedbacks,<br>Sijo<br><br>