[sword-devel] indexed search discrepancy

Sun Aug 30 14:14:20 MST 2009

On Aug 30, 2009, at 4:15 PM, Matthew Talbert <ransom1982 at gmail.com>  
wrote:

>>> Just out of curiosity, what are the silly transformations?
>>
>> See: http://www.gossamer-threads.com/lists/lucene/java-user/80838
>>
>> Basically, the StandardAnalyzer has a tokenizer that recognizes  
>> complex
>> patterns to determine word boundaries. By and large, these  
>> transformations
>> (e-mail addresses, host names, ...) won't be found in the Bible.  
>> Maybe in
>> commentaries and gen books. But there is a cost of running an  
>> expensive
>> analyzer that generally does nothing and occasionally does something
>> unexpected.
>>
>> The SimpleAnalyzer merely looks for word boundaries that are  
>> appropriate for
>> English. It is not appropriate for languages that have different  
>> punctuation
>> or word boundaries. There are a bunch of contributed analyzers for  
>> different
>> languages (e.g. Thai, Chinese) that are more appropriate for them.  
>> In the
>> upcoming Lucene 3.0 release there will be analyzers for more  
>> languages,
>> including Farsi. These could be ported from Java to C++ if they are  
>> valuable
>> to SWORD.
>
> But the StandardAnalyzer is no more appropriate for non-English,
> correct?

It is no more appropriate. But it may be less.

> So unless we have the non-English analyzers, then there is no
> value in using the StandardAnalyzer over the simple?

Even with the non-English analyzers there is no value in the  
StandardAnalyzer over the Simple.

> clucene is still
> trying to become compatible with Lucene 2 (I think it's largely done,
> but not released yet). If these analyzers are for Lucene 3.0

Most are part of 2.x.

> is it
> possible that it would take substantial work to port them to clucene
> which is still stuck in Lucene 1 compatibility?

I don't think the effort is much harder than doing an initial port to  
the same level. A tokenizer merely takes an input stream and breaks it  
up into tokens and returns a token each time next(...) is called. What  
differs between the releases is how next is implemented. The algorithm  
is the same. (BTW, I am a Lucene contributor wrt tokenizers so my  
point is not merely academic;)

In His Service,
DM