[jsword-devel] Lucene search bug
DM Smith
dmsmith555 at yahoo.com
Tue Jan 25 21:55:38 MST 2005
I had noted earlier that when I searched on "bread" in the KJV, I only
got about 20 hits.
I have been looking into what is happening.
In doing so I found a bug which at first I thought might have been
related. Seems that the call
BookData data = book.getData(subkey);
String text = data.getPlainText();
returns the verse reference butt up against the verse text, as in:
Gen 1.1In the beginning God created the heavens and the earth.....
Turns out that the document is something like:
<div>
<title>Gen 1.1</title>
<verse>In the beginning...</verse>
</div>
(this is leaving out attributes and other details)
It concatenates the text from all the children of the div element. Seems
to me that it should only do so for verse text. The code is insensitive
as to whether the text is for a title, note, footnote or some other
non-verse element.
How should it be? (In my copy, I have it skipping the title element.)
Anyway, enough with that digression from the indexing problem. I put in
a breakpoint on the verse when it contained "bread" and found that the
data was in fact getting to the indexer.
In looking at the verses, it seemed that they had "bread" in more than
once. This made me go down the wrong path of seeing whether it was only
indexing words in verses if they occurred multiple times.
I then ran a bunch of searches on common words (Lord, God, Jesus, bread,
...) and none of them came back with more than 21 verses. Also, after
deleting and regenerating the index (after I removed the leading verse
reference), the results were a different 20.
I think what is happening is that the search is not returning an
exhaustive answer, but is trying to come up with the top 20.
More information about the jsword-devel
mailing list