[bt-devel] change in search algo

Martin Gruner mg.pub at gmx.net
Wed Nov 1 02:25:32 MST 2006


Hello everyone,

unfortunately I have to correct myself on this. The Whitespace Analyzer is too 
simple. It does not even strip punctuation, which is why ".... elapse." is 
not found when searching for "elapse", but only for "elapse*". I changed the 
search algo back to use the Standard Analyzer, but now I managed to tell it 
not to use stop words. This should work as expected. Please report you 
success or failure. Code is in CVS and will be in 1.6.2.

A side note. BibleTime seems to find all hits correctly now. We had an issue 
with superfluous hits, but that was not clucene's fault. It found "elapse" in 
JFB Dan 8:27, where it does not occur. So I dumped the module with mod2imp, 
and found that Dan 9:0 does contain "elapse". So clucene was right, but 
BibleTime was wrong. We need to finally fix the Chapter 0 and Verse 0 issue.

Joachim, which places in the code would need to be changed for this? I want 
0:0 to be prepended to 1:1 and X:0 to X:1 for every book of the Bible. It 
must not be appended to the previous chapter.

Another side note: Luke (http://www.getopt.org/luke/), the Lucene Index 
Toolbox (java prog with web start) comes very handy when debugging lucene 
indexes. It showed me that the index for Dan 8:27 contained more text than 
expected. It can try to reconstuct the Document even if it was not stored in 
the index full-text.

Hope everyone is well,

mg



Am Samstag, 21. Oktober 2006 21:48 schrieb Martin Gruner:
> Hi friends,
>
> today I changed BibleTime's (CVS) search implementation from using the
> StandardAnalyzer to using the WhitespaceAnalyzer. The difference is that
> the StandardAnalyzer applies a set of default English stop words to the
> text being indexed and the queries. That means words like "the", "they" and
> "then" were not found, because they are assumed to produce too many
> results. Within BibleTime, this seems not acceptable to me, so I changed
> it. The new analyzer just splits the query into words according to the
> whitespace. Everything will be indexed and can be queried. This means the
> index will be slightly bigger, but everything can be found.
>
> Is this ok, or would somebody disagree? Please let me know.
>
> mg
>
>
> P.S. I also improved our own search highlighting a bit to handle "*" more
> correctly. The best solution, however, would be to use clucene for that as
> well...
>
> _______________________________________________
> bt-devel mailing list
> bt-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/bt-devel



More information about the bt-devel mailing list