[jsword-devel] search in jsword
Sijo Cherian
sijo.cherian at gmail.com
Sun Oct 8 20:03:22 MST 2006
All,
I started working on improving search by normalizing accented
characters(diacritic) in query.
Planning to use the lucene analysis contrib package for multi-lingual
support(http://lucene.apache.org/java/docs/api/index.html).
I am Starting with a new AnalyzerFactory (and resource prop) that supplies
appropriate instance of Analyzer based on bible language. Same analyzer
is used for indexing and for query parsing. Luckily our query always have a
bible in context, so we have a language at query time.
All corrections / suggestions are welocme and appreciated !
-- Tokenization Analysis--
-Current tokenization is based on SimpleAnalyzer(non-letter based
tokenization). It breaks for Chinese/Korean bibles.
--StopWord Analysis--
Question I asked myself: Is removing stop-word useful for us?
Common words skew the results or return overwhelming results.
Example Occurences in KJV of following words
the 24100
of 18200
be 5500
you 2000
your 1300
unto 7360
Occurences in MKJV
you 8350
your 4600
shall 6400
Both during indexing & query parsing, the stop word are removed. So queries
containing stopword terms will return smaller results
and hits are influenced by other unique terms.
--Stemming Analysis--
Basics: http://www.comp.lancs.ac.uk/computing/research/stemming/general/
Is Stemming useful for us?
-It will be useful in many latin langs, to treat Accented & corresponding
unaccented characters as same
- Along with stopword removal, it saves index space & search time.
-KJV Examples where stemming can benefit:
Query:
+sin +lord +sight
Returns 15 results. But missed verses with 'sins'/'sinned' eg Deuteronomy
9:18 , 2 Kings 24:3
Query
+harp +sing
Returns 6 results. Missed harps/singers/singing as in 1 Kings 10:12, 1 Chr
13:8, 15:16 , 2Chr 5:12, 9:11, Nehemiah 12:27
-If stemming is done by default, we can provide exact search operator, eg
exact(singers).
In that case we can do a post-retrieval filter, or index both stemmed &
unstemmed content(double space needed)
-- Available functionality in Lucene Analyzer jar--
1. Both Stopwords & Stemming
Snowball based:
English(Porter or Lovins)
German
French
Dutch
Russian
Note: Another German stemming option (based on Joerg Caumanns paper):
org.apache.lucene.analysis.de.GermanAnalyzer
Manfred, you may be interested to look at the stopwords & stemming in it. I
am guessing that snowball implementation
is preferable.
2. Stopwords only, no Stemming(from Lucene analysis contrib package)
Czech
Greek (donot know if applicable to Modern or Ancient Greek Bible)
3. Only Stemming [Volunteers can contribute stopwords]
Snowball based:
Spanish
Portuguese
Italian
Swedish
Norwegian
Danish
Finnish
4. Only tokenization (Lucene analysis contrib package)
ChineseAnalyzer (character tokenization): Better suited than CJKAnalyzer
(which does 2 overlapping chinese character tokenization)
-- Question I have now--
- Changing analyzers would involve current users to reindex all bible that
they have indexed already. How do we manage that? Should we have a version
framework for indexes, and BD can force a reindex when mismatch found.
-Can we improve our search & index independent of sword?
-In long term, do we have any requirements & UI framework support to extend
search to non-bible books?
Best,
Sijo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20061008/ab3e1b8e/attachment.html
More information about the jsword-devel
mailing list