[sword-devel] indexed search discrepancy

Sun Aug 30 07:17:56 MST 2009

On Aug 29, 2009, at 10:42 PM, Matthew Talbert wrote:

>
>
>> If backward compatibility is ok to be broken, I suggest changing from
>> StandardAnalyzer to SimpleAnalyzer. It does not have stopwords to  
>> begin with
>> and will index the text without the silly transformations that the
>> StandardAnalyzer does.
>
> Just out of curiosity, what are the silly transformations?

See: http://www.gossamer-threads.com/lists/lucene/java-user/80838

Basically, the StandardAnalyzer has a tokenizer that recognizes  
complex patterns to determine word boundaries. By and large, these  
transformations (e-mail addresses, host names, ...) won't be found in  
the Bible. Maybe in commentaries and gen books. But there is a cost of  
running an expensive analyzer that generally does nothing and  
occasionally does something unexpected.

The SimpleAnalyzer merely looks for word boundaries that are  
appropriate for English. It is not appropriate for languages that have  
different punctuation or word boundaries. There are a bunch of  
contributed analyzers for different languages (e.g. Thai, Chinese)  
that are more appropriate for them. In the upcoming Lucene 3.0 release  
there will be analyzers for more languages, including Farsi. These  
could be ported from Java to C++ if they are valuable to SWORD.

Another area that contributors to JSword have found useful: stemming.  
This is something that is an option on the JSword analyzers. There are  
a number of languages for which there are stemmers.

In Him,
	DM