[jsword-devel] ser vs. lucene

Joe Walker joseph.walker at gmail.com
Mon Aug 30 12:31:23 MST 2004


I'm trying to summarise some of the facts in the ser vs. lucene debate

Recap: ser is the search engine that I wrote a while ago, it come in 2
parts a word index and a search command interpreter. The word index
design is simple, it is a serialized (hence the name) hashmap where
the key is a word and the value is a key (list of matches)
Lucene is an Apache project that implements a search engine. It is
very widely used including in Eclipse, Jira, Scarab, SnipSnap and so
on. See http://jakarta.apache.org/lucene/

Benefits of Ser:
- Faster indexing (around 5 mins for an index compared to 6 for Lucene)
- Smaller by 240Kb
- Includes best match functionallity (see below)

Benefits of Lucene:
- Faster searching (I think, I've not quantified this though)
- i18n in that it has "stop-words" for languages other than English.
(A stop word is like "a", "the" and so on that it isn't worth
including in an index)
- Ability to store meta-data in the index (see below)
- More common search query language (see below)


Best Match
Ser has a best match function where you type in a verse as you think
it goes and it will try to find it for you. The algorthm simply stems
all the words (chops off common endings, so loves becomes love) and
then searches for all words that start with the root word, and returns
the verses that include the most hits. We could probably extract the
best match logic from ser and allow it to be applied to lucene.

Meta-Data:
Lucene will allow us to have several separate indexes in the same unit
so we could include extra information like notes and marginal
references in the index. This means that with some extra data we could
do interesting searches like:
- Give me all the verses that include marginal references to this
verse. (which could be very useful since so many references are
frustratingly one way)
- Find all the verses that have notes with the text "some manuscripts omit"
- Find all the verses that happened within 1 lifetime of verse X.

Search Language:
A failing of ser is that the query parser does not understand "quoted
searches" where you want to ensure that the verse contains the given
words in the exact order given. It wouldn't be too hard to extend the
Ser query parser to be able to do quoted searches.


I'll express opinions in a bit. I have no emmotional attachement to
Ser so if you want to say "use Lucene because the ser code is rubbish"
then I won't mind!

Joe.


More information about the jsword-devel mailing list