[sword-devel] indexed search discrepancy
DM Smith
dmsmith at crosswire.org
Sun Aug 30 13:57:52 MST 2009
On Aug 30, 2009, at 4:07 PM, Matthew Talbert <ransom1982 at gmail.com>
wrote:
>> I had submitted a patch that did this and it was rejected because
>> it did not
>> preserve backward compatibility without providing a versioning
>> system for
>> each generated index.
>
> If by backward compatibility, you mean that old indexes will still
> work as they always have, then backwards compatibility is being
> preserved (this is how I would interpret it).
This is what I meant. The analyzer is used to tokenize both the text
going into the index and the search request. If both are not tokenized
the same there will be mismatches.
Some examples: old index w/o stopwords and engine that preserves them.
In the following example IN is a stopword.
Search a phrase w/ a stop word. "in Christ" will look for all soca
containing both "in" and "Christ" with the first immediatily preceding
the second.
Search for the same but not as a phrase. The default action is to find
all verses that contain either word. This will find all verses with
Christ and none with In. This is the same as searching for IN OR CHRIST
If the default is overridden to mean AND or the search is IN AND
CHRIST then no verses will be found.
> But new indexes will
> obviously be different than the old ones. If this is what you mean,
> then we really can't change anything in the indexing until some
> versioning scheme is implemented, correct? The recent Hebrew changes
> broke both of these principles: old indexes are unusable (will return
> 0 results for modules that have Hebrew vowels), and new indexes are
> different than the old ones.
IMHO, bugs need to be fixed but in a way that does not compromise good
indexes. Changing the limit is one of those changes. It does not harm
indexes that never hit the limit. The tough part is disttingishing
between the two and helping the user fix the problem.
> The changes to the size of the fields
> allowed will do the same thing, although old indexes will still be
> usable (if you call returning 30% of the actual hits usable). I agree
> with the need for versioning (I mentioned it first in this thread :)
> ), but to not fix bugs because of that seems silly.
Agreed. Just need to be careful to preserve BC in so far as possible.
(BTW, you were first in this thread to mention versioning but there
were earlier threads to discuss it. :)
>
>> As to using a simple incrementing number to represent the version
>> of the
>> index, this may not be adequate. It is sufficient if the user has
>> no control
>> over the index and indexes that do not match the version number of
>> the
>> engine are ignored/discarded/automatically upgraded... by the front-
>> end or
>> engine.
>
> I believe we should follow the principle of "do the simplest thing
> that will possibly work". All we need at the moment is a simple
> version number. Everything without version numbers will be presumed to
> be older. In my opinion, if the version number is older than the
> (index) version of the library, then the library should just return
> false when asked if the module has fast search framework (I forget the
> function name). Then the front-end can do whatever it needs in that
> situation. This also has the advantage of not needing a new API.
I suggest to plan for the future and implement for the present. A
simple number is not sufficient for the future. A versioned list of
features would be. An ini file w/ a list of features would work well
e.g.
[index]
lucene=1.4.3
StandardAnalyzer=2
Notes=1
Headings=1
...
>
>> Give the user any control over the index or provide the front-end any
>> indication of what is in the index and it is not sufficient.
>> Further, once
>> we get to analyzers per language each feature needs a version
>> number as
>> well.
>>
>> Very messy.
>
> Yes, but we're not there today. Considering that currently none of the
> non-English analyzers are ported to C++, to not do something now, or
> to design a complicated system based on functionality that may never
> arrive, seems backwards.
>
>> The solution we have for BibleDesktop/JSword is to just let the
>> user know
>> that if search does not perform as expected to delete the index and
>> rebuild
>> it. Not at all a good solution, but we've not had any complaints.
>
> The best solution is not always the most technically correct solution.
> As above, many times it's the simplest solution that is best.
>
> Matthew
That's why JSword hasn't tackled it yet (we have the beginnings of an
implementation) and why I submitted a patch to SWORD that didn't have
versioning.
But it was rejected. Maybe this time is different.
In Christ,
DM
More information about the sword-devel
mailing list