[sword-devel] Searching and Lucene thoughts
Will Thimbleby
will at thimbleby.net
Wed Mar 2 16:48:22 MST 2005
On 2 Mar 2005, at 12:45 am, DM Smith wrote:
> Can we enumerate what Lucene does not support that we want for
> Biblical searching?
>
> The only thing I saw was that it did not find adjacent documents. For
> example, find all verses containing Moses within 5 verses of Aaron.
>
> As long as we build the index from first verse to last verse, the
> index that lucene returns is the number lucene assigned to the verse
> when the verse was added. We cannot reliably use this to figure out
> what verse is returned (e.g. 3 may or may not mean Genesis 1:3. For
> example, in a NT only module it would mean Matthew 1:3), for this
> reason we have stored the OSIS reference in the index along with the
> verse. However, we can be certain (cause lucene guarantees it) that
> index 25 and index 26 are two verses that were added one after the
> other.
>
> To do proximity searching, we probably have to parse the search
> request for a special w/in conjunction and take each part and do
> separate queries, an via post processing, put the result together.
>
> Has anyone thought of another way?
Here are some things Accordance does: -- it just seems over complicated
to me (I can't see how some of the features would ever be used other
than tedious academic research)
It can search within: verse, chapter, clause, sentance, paragraph, book
You can specify tags for: stem, aspect, person, gender, number, state
Examples:
creat* <FOLLOWED BY> <WITHIN 10 WORDS> earth <NOT> made
[VERB perfect] @~~~ (hebrew chars)
The only thing afaik that lucene wont do for us with a bit of work is
to do multi-document searching. Searching across verses is confusing --
the only constructs that make sense are proximity constructs. Looking
at the source for lucene I *might* actually be able to do this. I'll
get back to you on this.
>> Troy: you asked for my code to access index order, I can give you
>> java code, but clucene doesn't support it yet. There seem to be many
>> areas where clucene is lagging far behind lucene. For example,
>> sorting which to do in lucene is essential for fast searching.
>>
> I would be interested in the Java code, if you don't mind.
I don't access it as such I just pass the index sorter to the searcher
eg. s.search(query, Sort.INDEXORDER) I'm not sure how to access the id
itself.
> <snip/>
>> Restricting of searches:
>> Again another area that is essential for speed to do in lucene. I
>> haven't figured this one out yet, but I'm thinking I will write a
>> custom lucene filter. Which would be much faster if I stored the
>> verse as an index, and then produced a set of numerical ranges. For
>> searching in the previous results, you should (I've been told) simply
>> AND the searches together. I don't support these yet, and it is
>> probably quite some work, -- it would probably only take 10s of
>> searching time to retrofit it ontop of lucene, but that is 10s ontop
>> of nothing.
>
> The search speed of lucene is fast enough that restricting the search
> is not necessary. Using the BitSet does not add appreciable time. It
> is easy enough to create a mask and AND that with the search results
> to get the restricted answer set.
How do you use your BitSet? I like it at the moment where I don't
access the document information at all until it is displayed. This
means I can do live-searching (as the user types) for even large
searches like "and".
>> Other stuff:
>> Fuzzy searches are neat "abraham~" finds abram and abraham;
>> "hezikia~" finds hezekiah. Really useful for bad spellers and all
>> those ridiculously impossible to spell bible names.
>> To highlight searches, you can get lucene to give you a list of
>> words for a search. You can then highlight all of these words in the
>> verse.
>
> I saw your other post on fuzzy match and would like to know how you
> got the words that were hit out of lucene.
Have a look in lucene/contrib/highlighter/ ... /QueryTermExtractor.java
I just cut the useful bits from it.
>> IMO rarely do people want to do OR searches, so I changed the default
>> to AND in the lucene version used by MacSword. This means >>jesus
>> wept<< is ANDed >>jesus OR wept<< is ORed, and >>"jesus wept"<< is
>> the phrase. Other than that the lucene syntax makes sense.
>
> On a project that I did we found that people wanted to do phrase
> searching even more than AND and AND more than OR, unless they were
> doing "natural language" quering.
> It might be nice to set it as a preference.
I thought that would be the case, however from a syntax point of view I
found having phrases as default confusing. I think >"jesus is" +god< is
clearer than >jesus is +god< -- but there is possibly a better way. I
don't think having a preference is a good idea though, I like having
one syntax it is one less thing for the user and simplifies my search
windows.
More information about the sword-devel
mailing list