[jsword-devel] search in jsword
sijo.cherian at gmail.com
Sun Oct 8 20:03:22 MST 2006
I started working on improving search by normalizing accented
characters(diacritic) in query.
Planning to use the lucene analysis contrib package for multi-lingual
I am Starting with a new AnalyzerFactory (and resource prop) that supplies
appropriate instance of Analyzer based on bible language. Same analyzer
is used for indexing and for query parsing. Luckily our query always have a
bible in context, so we have a language at query time.
All corrections / suggestions are welocme and appreciated !
-- Tokenization Analysis--
-Current tokenization is based on SimpleAnalyzer(non-letter based
tokenization). It breaks for Chinese/Korean bibles.
Question I asked myself: Is removing stop-word useful for us?
Common words skew the results or return overwhelming results.
Example Occurences in KJV of following words
Occurences in MKJV
Both during indexing & query parsing, the stop word are removed. So queries
containing stopword terms will return smaller results
and hits are influenced by other unique terms.
Is Stemming useful for us?
-It will be useful in many latin langs, to treat Accented & corresponding
unaccented characters as same
- Along with stopword removal, it saves index space & search time.
-KJV Examples where stemming can benefit:
+sin +lord +sight
Returns 15 results. But missed verses with 'sins'/'sinned' eg Deuteronomy
9:18 , 2 Kings 24:3
Returns 6 results. Missed harps/singers/singing as in 1 Kings 10:12, 1 Chr
13:8, 15:16 , 2Chr 5:12, 9:11, Nehemiah 12:27
-If stemming is done by default, we can provide exact search operator, eg
In that case we can do a post-retrieval filter, or index both stemmed &
unstemmed content(double space needed)
-- Available functionality in Lucene Analyzer jar--
1. Both Stopwords & Stemming
English(Porter or Lovins)
Note: Another German stemming option (based on Joerg Caumanns paper):
Manfred, you may be interested to look at the stopwords & stemming in it. I
am guessing that snowball implementation
2. Stopwords only, no Stemming(from Lucene analysis contrib package)
Greek (donot know if applicable to Modern or Ancient Greek Bible)
3. Only Stemming [Volunteers can contribute stopwords]
4. Only tokenization (Lucene analysis contrib package)
ChineseAnalyzer (character tokenization): Better suited than CJKAnalyzer
(which does 2 overlapping chinese character tokenization)
-- Question I have now--
- Changing analyzers would involve current users to reindex all bible that
they have indexed already. How do we manage that? Should we have a version
framework for indexes, and BD can force a reindex when mismatch found.
-Can we improve our search & index independent of sword?
-In long term, do we have any requirements & UI framework support to extend
search to non-bible books?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the jsword-devel