[jsword-devel] ser vs. lucene
DM Smith
dmsmith555 at yahoo.com
Mon Aug 30 13:01:26 MST 2004
I think that an evaluation should include what features we have and what
features we want.
Some of the things I would like to see would be:
Robust query language. (and or not within after ...)
Phrase searching
Stop word reduction as a preference of the user on a per bible basis.
Internationalization of connectors (which ser has since it uses symbols)
Natural language connectors (which would be a mapping of unquoted words
to symbols on a per language basis. For example, and == & in English,
"AND" == AND as a word, since it is quoted)
Natural language search (which I think match is a good beginning
implementation)
Stem searching such as depluralization.
Thesarus/synonym/equivalence searching.
Normalization of accented text and search request, with disambiguation
using original text and original search request.
Transliterated search for Hebrew and Greek.
Highlighting of hit words.
Wild-carding of words.
Some of this is in how we feed text to the serialization/indexing engine.
If the index has a location for each word roughly equivalent to (ordinal
verse number, position of word in verse) then we can do it. Either using
the engines query language or by layering our own on top of it.
If lucene has wild-carding of words and a robust location, I think we
could write match on top of Lucene.
Joe Walker wrote:
>I'm trying to summarise some of the facts in the ser vs. lucene debate
>
>Recap: ser is the search engine that I wrote a while ago, it come in 2
>parts a word index and a search command interpreter. The word index
>design is simple, it is a serialized (hence the name) hashmap where
>the key is a word and the value is a key (list of matches)
>Lucene is an Apache project that implements a search engine. It is
>very widely used including in Eclipse, Jira, Scarab, SnipSnap and so
>on. See http://jakarta.apache.org/lucene/
>
>Benefits of Ser:
>- Faster indexing (around 5 mins for an index compared to 6 for Lucene)
>- Smaller by 240Kb
>
>
Did you mean the class size or the resulting index size?
>- Includes best match functionallity (see below)
>
>Benefits of Lucene:
>- Faster searching (I think, I've not quantified this though)
>- i18n in that it has "stop-words" for languages other than English.
>(A stop word is like "a", "the" and so on that it isn't worth
>including in an index)
>- Ability to store meta-data in the index (see below)
>- More common search query language (see below)
>
>
>Best Match
>Ser has a best match function where you type in a verse as you think
>it goes and it will try to find it for you. The algorthm simply stems
>all the words (chops off common endings, so loves becomes love) and
>then searches for all words that start with the root word, and returns
>the verses that include the most hits. We could probably extract the
>best match logic from ser and allow it to be applied to lucene.
>
>Meta-Data:
>Lucene will allow us to have several separate indexes in the same unit
>so we could include extra information like notes and marginal
>references in the index. This means that with some extra data we could
>do interesting searches like:
>- Give me all the verses that include marginal references to this
>verse. (which could be very useful since so many references are
>frustratingly one way)
>- Find all the verses that have notes with the text "some manuscripts omit"
>- Find all the verses that happened within 1 lifetime of verse X.
>
>Search Language:
>A failing of ser is that the query parser does not understand "quoted
>searches" where you want to ensure that the verse contains the given
>words in the exact order given. It wouldn't be too hard to extend the
>Ser query parser to be able to do quoted searches.
>
>
>I'll express opinions in a bit. I have no emmotional attachement to
>Ser so if you want to say "use Lucene because the ser code is rubbish"
>then I won't mind!
>
>Joe.
>_______________________________________________
>jsword-devel mailing list
>jsword-devel at crosswire.org
>http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
>
More information about the jsword-devel
mailing list