[bt-devel] change in search algo
Martin Gruner
mg.pub at gmx.net
Wed Nov 1 02:25:32 MST 2006
Hello everyone,
unfortunately I have to correct myself on this. The Whitespace Analyzer is too
simple. It does not even strip punctuation, which is why ".... elapse." is
not found when searching for "elapse", but only for "elapse*". I changed the
search algo back to use the Standard Analyzer, but now I managed to tell it
not to use stop words. This should work as expected. Please report you
success or failure. Code is in CVS and will be in 1.6.2.
A side note. BibleTime seems to find all hits correctly now. We had an issue
with superfluous hits, but that was not clucene's fault. It found "elapse" in
JFB Dan 8:27, where it does not occur. So I dumped the module with mod2imp,
and found that Dan 9:0 does contain "elapse". So clucene was right, but
BibleTime was wrong. We need to finally fix the Chapter 0 and Verse 0 issue.
Joachim, which places in the code would need to be changed for this? I want
0:0 to be prepended to 1:1 and X:0 to X:1 for every book of the Bible. It
must not be appended to the previous chapter.
Another side note: Luke (http://www.getopt.org/luke/), the Lucene Index
Toolbox (java prog with web start) comes very handy when debugging lucene
indexes. It showed me that the index for Dan 8:27 contained more text than
expected. It can try to reconstuct the Document even if it was not stored in
the index full-text.
Hope everyone is well,
mg
Am Samstag, 21. Oktober 2006 21:48 schrieb Martin Gruner:
> Hi friends,
>
> today I changed BibleTime's (CVS) search implementation from using the
> StandardAnalyzer to using the WhitespaceAnalyzer. The difference is that
> the StandardAnalyzer applies a set of default English stop words to the
> text being indexed and the queries. That means words like "the", "they" and
> "then" were not found, because they are assumed to produce too many
> results. Within BibleTime, this seems not acceptable to me, so I changed
> it. The new analyzer just splits the query into words according to the
> whitespace. Everything will be indexed and can be queried. This means the
> index will be slightly bigger, but everything can be found.
>
> Is this ok, or would somebody disagree? Please let me know.
>
> mg
>
>
> P.S. I also improved our own search highlighting a bit to handle "*" more
> correctly. The best solution, however, would be to use clucene for that as
> well...
>
> _______________________________________________
> bt-devel mailing list
> bt-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/bt-devel
More information about the bt-devel
mailing list