[sword-devel] indexed search discrepancy
DM Smith
dmsmith at crosswire.org
Sun Aug 30 14:14:20 MST 2009
On Aug 30, 2009, at 4:15 PM, Matthew Talbert <ransom1982 at gmail.com>
wrote:
>>> Just out of curiosity, what are the silly transformations?
>>
>> See: http://www.gossamer-threads.com/lists/lucene/java-user/80838
>>
>> Basically, the StandardAnalyzer has a tokenizer that recognizes
>> complex
>> patterns to determine word boundaries. By and large, these
>> transformations
>> (e-mail addresses, host names, ...) won't be found in the Bible.
>> Maybe in
>> commentaries and gen books. But there is a cost of running an
>> expensive
>> analyzer that generally does nothing and occasionally does something
>> unexpected.
>>
>> The SimpleAnalyzer merely looks for word boundaries that are
>> appropriate for
>> English. It is not appropriate for languages that have different
>> punctuation
>> or word boundaries. There are a bunch of contributed analyzers for
>> different
>> languages (e.g. Thai, Chinese) that are more appropriate for them.
>> In the
>> upcoming Lucene 3.0 release there will be analyzers for more
>> languages,
>> including Farsi. These could be ported from Java to C++ if they are
>> valuable
>> to SWORD.
>
> But the StandardAnalyzer is no more appropriate for non-English,
> correct?
It is no more appropriate. But it may be less.
> So unless we have the non-English analyzers, then there is no
> value in using the StandardAnalyzer over the simple?
Even with the non-English analyzers there is no value in the
StandardAnalyzer over the Simple.
> clucene is still
> trying to become compatible with Lucene 2 (I think it's largely done,
> but not released yet). If these analyzers are for Lucene 3.0
Most are part of 2.x.
> is it
> possible that it would take substantial work to port them to clucene
> which is still stuck in Lucene 1 compatibility?
I don't think the effort is much harder than doing an initial port to
the same level. A tokenizer merely takes an input stream and breaks it
up into tokens and returns a token each time next(...) is called. What
differs between the releases is how next is implemented. The algorithm
is the same. (BTW, I am a Lucene contributor wrt tokenizers so my
point is not merely academic;)
In His Service,
DM
More information about the sword-devel
mailing list