[sword-devel] indexed search discrepancy

Sun Aug 30 13:57:52 MST 2009

On Aug 30, 2009, at 4:07 PM, Matthew Talbert <ransom1982 at gmail.com>  
wrote:

>> I had submitted a patch that did this and it was rejected because  
>> it did not
>> preserve backward compatibility without providing a versioning  
>> system for
>> each generated index.
>
> If by backward compatibility, you mean that old indexes will still
> work as they always have, then backwards compatibility is being
> preserved (this is how I would interpret it).

This is what I meant. The analyzer is used to tokenize both the text  
going into the index and the search request. If both are not tokenized  
the same there will be mismatches.

Some examples: old index w/o stopwords and engine that preserves them.  
In the following example IN is a stopword.

Search a phrase w/ a stop word. "in Christ" will look for all soca  
containing both "in" and "Christ" with the first immediatily preceding  
the second.

Search for the same but not as a phrase. The default action is to find  
all verses that contain either word. This will find all verses with  
Christ and none with In. This is the same as searching for IN OR CHRIST

If the default is overridden to mean AND or the search is IN AND  
CHRIST then no verses will be found.

> But new indexes will
> obviously be different than the old ones. If this is what you mean,
> then we really can't change anything in the indexing until some
> versioning scheme is implemented, correct? The recent Hebrew changes
> broke both of these principles: old indexes are unusable (will return
> 0 results for modules that have Hebrew vowels), and new indexes are
> different than the old ones.

IMHO, bugs need to be fixed but in a way that does not compromise good  
indexes. Changing the limit is one of those changes. It does not harm  
indexes that never hit the limit. The tough part is disttingishing  
between the two and helping the user fix the problem.

> The changes to the size of the fields
> allowed will do the same thing, although old indexes will still be
> usable (if you call returning 30% of the actual hits usable). I agree
> with the need for versioning (I mentioned it first in this thread :)
> ), but to not fix bugs because of that seems silly.

Agreed. Just need to be careful to preserve BC in so far as possible.  
(BTW, you were first in this thread to mention versioning but there  
were earlier threads to discuss it. :)

>
>> As to using a simple incrementing number to represent the version  
>> of the
>> index, this may not be adequate. It is sufficient if the user has  
>> no control
>> over the index and indexes that do not match the version number of  
>> the
>> engine are ignored/discarded/automatically upgraded... by the front- 
>> end or
>> engine.
>
> I believe we should follow the principle of "do the simplest thing
> that will possibly work". All we need at the moment is a simple
> version number. Everything without version numbers will be presumed to
> be older. In my opinion, if the version number is older than the
> (index) version of the library, then the library should just return
> false when asked if the module has fast search framework (I forget the
> function name). Then the front-end can do whatever it needs in that
> situation. This also has the advantage of not needing a new API.

I suggest to plan for the future and implement for the present. A  
simple number is not sufficient for the future. A versioned list of  
features would be. An ini file w/ a list of features would work well  
e.g.
[index]
lucene=1.4.3
StandardAnalyzer=2
Notes=1
Headings=1
...

>
>> Give the user any control over the index or provide the front-end any
>> indication of what is in the index and it is not sufficient.  
>> Further, once
>> we get to analyzers per language each feature needs a version  
>> number as
>> well.
>>
>> Very messy.
>
> Yes, but we're not there today. Considering that currently none of the
> non-English analyzers are ported to C++, to not do something now, or
> to design a complicated system based on functionality that may never
> arrive, seems backwards.
>
>> The solution we have for BibleDesktop/JSword is to just let the  
>> user know
>> that if search does not perform as expected to delete the index and  
>> rebuild
>> it. Not at all a good solution, but we've not had any complaints.
>
> The best solution is not always the most technically correct solution.
> As above, many times it's the simplest solution that is best.
>
> Matthew

That's why JSword hasn't tackled it yet (we have the beginnings of an  
implementation) and why I submitted a patch to SWORD that didn't have  
versioning.

But it was rejected. Maybe this time is different.

In Christ,
DM