[sword-devel] Searching and Lucene thoughts

Wed Mar 2 16:48:22 MST 2005

On 2 Mar 2005, at 12:45 am, DM Smith wrote:
> Can we enumerate what Lucene does not support that we want for 
> Biblical searching?
>
> The only thing I saw was that it did not find adjacent documents. For 
> example, find all verses containing Moses within 5 verses of Aaron.
>
> As long as we build the index from first verse to last verse, the 
> index that lucene returns is the number lucene assigned to the verse 
> when the verse was added. We cannot reliably use this to figure out 
> what verse is returned (e.g. 3 may or may not mean Genesis 1:3. For 
> example, in a NT only module it would mean Matthew 1:3), for this 
> reason we have stored the OSIS reference in the index along with the 
> verse. However, we can be certain (cause lucene guarantees it) that 
> index 25 and index 26 are two verses that were added one after the 
> other.
>
> To do proximity searching, we probably have to parse the search 
> request for a special w/in conjunction and take each part and do 
> separate queries, an via post processing, put the result together.
>
> Has anyone thought of another way?

Here are some things Accordance does: -- it just seems over complicated 
to me (I can't see how some of the features would ever be used other 
than tedious academic research)

It can search within: verse, chapter, clause, sentance, paragraph, book
You can specify tags for: stem, aspect, person, gender, number, state
Examples:
creat* <FOLLOWED BY> <WITHIN 10 WORDS> earth <NOT> made
[VERB perfect] @~~~ (hebrew chars)

The only thing afaik that lucene wont do for us with a bit of work is 
to do multi-document searching. Searching across verses is confusing -- 
the only constructs that make sense are proximity constructs. Looking 
at the source for lucene I *might* actually be able to do this. I'll 
get back to you on this.

>> Troy: you asked for my code to access index order, I can give you 
>> java code, but clucene doesn't support it yet. There seem to be many 
>> areas where clucene is lagging far behind lucene. For example, 
>> sorting which to do in lucene is essential for fast searching.
>>
> I would be interested in the Java code, if you don't mind.

I don't access it as such I just pass the index sorter to the searcher 
eg. s.search(query, Sort.INDEXORDER) I'm not sure how to access the id 
itself.

> <snip/>
>> Restricting of searches:
>> Again another area that is essential for speed to do in lucene. I 
>> haven't figured this one out yet, but I'm thinking I will write a 
>> custom lucene filter. Which would be much faster if I stored the 
>> verse as an index, and then produced a set of numerical ranges. For 
>> searching in the previous results, you should (I've been told) simply 
>> AND the searches together. I don't support these yet, and it is 
>> probably quite some work, -- it would probably only take 10s of 
>> searching time to retrofit it ontop of lucene, but that is 10s ontop 
>> of nothing.
>
> The search speed of lucene is fast enough that restricting the search 
> is not necessary. Using the BitSet does not add appreciable time. It 
> is easy enough to create a mask and AND that with the search results 
> to get the restricted answer set.

How do you use your BitSet? I like it at the moment where I don't 
access the document information at all until it is displayed. This 
means I can do live-searching (as the user types) for even large 
searches like "and".

>> Other stuff:
>> Fuzzy searches are neat "abraham~" finds abram and abraham; 
>> "hezikia~" finds hezekiah. Really useful for bad spellers and all 
>> those ridiculously impossible to spell bible names.
>>     To highlight searches, you can get lucene to give you a list of 
>> words for a search. You can then highlight all of these words in the 
>> verse.
>
> I saw your other post on fuzzy match and would like to know how you 
> got the words that were hit out of lucene.

Have a look in lucene/contrib/highlighter/ ... /QueryTermExtractor.java 
I just cut the useful bits from it.

>> IMO rarely do people want to do OR searches, so I changed the default 
>> to AND in the lucene version used by MacSword. This means >>jesus 
>> wept<< is ANDed >>jesus OR wept<< is ORed, and >>"jesus wept"<< is 
>> the phrase. Other than that the lucene syntax makes sense.
>
> On a project that I did we found that people wanted to do phrase 
> searching even more than AND and AND more than OR, unless they were 
> doing "natural language" quering.
> It might be nice to set it as a preference.

I thought that would be the case, however from a syntax point of view I 
found having phrases as default confusing. I think >"jesus is" +god< is 
clearer than >jesus is +god< -- but there is possibly a better way. I 
don't think having a preference is a good idea though, I like having 
one syntax it is one less thing for the user and simplifies my search 
windows.