[jsword-devel] search in jsword

Sun Oct 8 20:03:22 MST 2006

All,

I started working on improving search by normalizing accented
characters(diacritic) in query.
Planning to use the lucene analysis contrib package for multi-lingual
support(http://lucene.apache.org/java/docs/api/index.html).
I am Starting with a new AnalyzerFactory (and resource prop) that supplies
appropriate instance of Analyzer based on bible language. Same analyzer
is used for indexing and for query parsing. Luckily our query always have a
bible in context, so we have a language at query time.

All corrections / suggestions are welocme and appreciated !

-- Tokenization Analysis--
-Current tokenization is based on SimpleAnalyzer(non-letter based
tokenization). It breaks for Chinese/Korean bibles.

--StopWord Analysis--
Question I asked myself: Is removing stop-word useful for us?

Common words skew the results or return overwhelming results.

Example Occurences in KJV of following words
the    24100
of      18200
be    5500
you    2000
your    1300
unto     7360

Occurences in MKJV
you    8350
your    4600
shall    6400

Both during indexing & query parsing, the stop word are removed. So queries
containing stopword terms will return smaller results
and hits are influenced by other unique terms.

--Stemming Analysis--
Basics: http://www.comp.lancs.ac.uk/computing/research/stemming/general/

Is Stemming useful for us?

-It will be useful in many latin langs, to treat Accented & corresponding
unaccented characters as same
- Along with stopword removal, it saves index space & search time.
-KJV Examples where stemming can benefit:
Query:
+sin +lord +sight
Returns 15 results. But missed verses with 'sins'/'sinned' eg Deuteronomy
9:18 , 2 Kings 24:3

Query
+harp +sing
Returns 6 results. Missed harps/singers/singing as in 1 Kings 10:12, 1 Chr
13:8, 15:16 , 2Chr 5:12, 9:11, Nehemiah 12:27

-If stemming is done by default, we can provide exact search operator, eg
exact(singers).
In that case we can do a post-retrieval filter, or index both stemmed &
unstemmed content(double space needed)

-- Available functionality in Lucene Analyzer jar--

1. Both Stopwords & Stemming
Snowball based:
English(Porter or Lovins)
German
French
Dutch
Russian

Note: Another German stemming option (based on Joerg Caumanns paper):
org.apache.lucene.analysis.de.GermanAnalyzer
Manfred, you may be interested to look at the stopwords & stemming in it. I
am guessing that snowball implementation
is preferable.

2. Stopwords only, no Stemming(from Lucene analysis contrib package)
Czech
Greek (donot know if applicable to Modern or Ancient Greek Bible)

3. Only Stemming  [Volunteers can contribute stopwords]
Snowball based:
 Spanish
 Portuguese
 Italian
 Swedish
 Norwegian
 Danish
 Finnish

4. Only tokenization (Lucene analysis contrib package)
ChineseAnalyzer (character tokenization): Better suited than CJKAnalyzer
(which does 2 overlapping chinese character tokenization)

-- Question I have now--

- Changing analyzers would involve current users to reindex all bible that
they have indexed already. How do we manage that? Should we have a version
framework for indexes, and BD can force a reindex when mismatch found.

-Can we improve our search & index independent of sword?

-In long term, do we have any requirements & UI framework support to extend
search to non-bible books?

Best,
Sijo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.crosswire.org/pipermail/jsword-devel/attachments/20061008/ab3e1b8e/attachment.html