[sword-devel] indexed search discrepancy
DM Smith
dmsmith at crosswire.org
Sun Aug 30 07:17:56 MST 2009
On Aug 29, 2009, at 10:42 PM, Matthew Talbert wrote:
>
>
>> If backward compatibility is ok to be broken, I suggest changing from
>> StandardAnalyzer to SimpleAnalyzer. It does not have stopwords to
>> begin with
>> and will index the text without the silly transformations that the
>> StandardAnalyzer does.
>
> Just out of curiosity, what are the silly transformations?
See: http://www.gossamer-threads.com/lists/lucene/java-user/80838
Basically, the StandardAnalyzer has a tokenizer that recognizes
complex patterns to determine word boundaries. By and large, these
transformations (e-mail addresses, host names, ...) won't be found in
the Bible. Maybe in commentaries and gen books. But there is a cost of
running an expensive analyzer that generally does nothing and
occasionally does something unexpected.
The SimpleAnalyzer merely looks for word boundaries that are
appropriate for English. It is not appropriate for languages that have
different punctuation or word boundaries. There are a bunch of
contributed analyzers for different languages (e.g. Thai, Chinese)
that are more appropriate for them. In the upcoming Lucene 3.0 release
there will be analyzers for more languages, including Farsi. These
could be ported from Java to C++ if they are valuable to SWORD.
Another area that contributors to JSword have found useful: stemming.
This is something that is an option on the JSword analyzers. There are
a number of languages for which there are stemmers.
In Him,
DM
More information about the sword-devel
mailing list