[jsword-devel] Searching

Joe Walker jsword-devel@crosswire.org
Mon, 28 Apr 2003 05:05:07 +0100


Eric Galluzzo wrote:

> Well, I have basically none; however, couldn't we just use something
> like Lucene (http://jakarta.apache.org/lucene/) to do all the hard work
> for us?  Then we could take advantage of all their expertise, as well as
> all those fancy queries that they support (e.g. X within four words of
> Y).  It's pretty extensible, so we could do things like filtering by
> book, chapter, and verse just by adding fields to the index.

Yes, in fact I refactored stuff a while ago to make the search engine 
pluggable, and Lucene is up there, but I've not finished coding the 
Lucene plug-in.

The current search engine does fancy things like "within 5 verses of" 
and has some primitive stemming functionallity although (I think like 
Lucene's) it only works with English.

Comparision:
LuceneSearchEngine
+ Supports meta-data searches (so we could tag verses with dates and
     then do a search for "All Joseph's alive at the time of Jesus")
+ Fast Indexing
+ Supports "phrase" and "find all words" searching

SerSearchEngine (our engine using Serialized index files)
+ Lightweight
+ Has query parser separated from indexer

> And if we get some nice tagged Greek texts, we might even be able to
> support "fancy" searches like the nicer Bible packages do that say "find
> me an aorist subjunctive 'baptizo' which has a 'de' right before it, and
> which is within five words of any form of the word 'sozo'."  Of course,
> if we don't have tagged Greek texts, we might be able to do this by a
> fancy stemmer, but that sounds complicated.... ;)  I'm not sure if
> Lucene actually supports all this stuff, but I do know it supports the
> "within X words of" operator, and we could probably extend it as needed.

If you want to have a go at enabling Lucene please do.
org.crosswire.jsword.book.search is the place to start looking. There is 
a LuceneSearchEngine and a SerSearchEngine both that implement 
SearchEngine. The Parser and Index interfaces are used to separate query 
parser from indexer, but you should be able to ignore them both.
Finally the Search class (in org.crosswire.jsword.book) is basically a 
simple wrapper around a search string.

Joe.