[jsword-devel] search in jsword

Mon Oct 9 08:00:05 MST 2006

Sijo Cherian wrote:
> All,
>
> I started working on improving search by normalizing accented 
> characters(diacritic) in query.

Welcome Sijo! I am glad to have your contributions!

> Planning to use the lucene analysis contrib package for multi-lingual 
> support( http://lucene.apache.org/java/docs/api/index.html).
> I am Starting with a new AnalyzerFactory (and resource prop) that 
> supplies appropriate instance of Analyzer based on bible language. 
> Same analyzer
> is used for indexing and for query parsing.
Yes, Lucene requires that the same analyzer be used for indexing and 
searching.

And your implementation falls in with the JSword architecture!
In the resource prop, how about a default entry that is used if we have 
a miss (either language is not found or reflection fails) that supplies 
the current analyzer? It should be "asserted" that that analyzer must work.

> Luckily our query always have a bible in context, so we have a 
> language at query time.
>
> All corrections / suggestions are welocme and appreciated !
>
> -- Tokenization Analysis--
> -Current tokenization is based on SimpleAnalyzer(non-letter based 
> tokenization). It breaks for Chinese/Korean bibles.

I'm sure it also breaks on Thai.

The value of SimpleAnalyzer over the StandardAnalyzer is that it does 
not throw away the stop words and it does not strip of 's at the end of 
the words nor remove '.' from possible acronyms.

>
> --StopWord Analysis--
> Question I asked myself: Is removing stop-word useful for us?

 From a theological perspective, stop words are frequently part of a 
significant theological phrase. Such as, "in Him". We might make using 
stop words a preference, but for those that use JSword for study as 
opposed to lookup it is important.

When including stop words it is often good to do a prioritized search.

>
> Common words skew the results or return overwhelming results.
>
> Example Occurences in KJV of following words
> the    24100
> of      18200
> be    5500
> you    2000
> your    1300
> unto     7360
>
> Occurences in MKJV
> you    8350
> your    4600
> shall    6400
>
> Both during indexing & query parsing, the stop word are removed. So 
> queries containing stopword terms will return smaller results
> and hits are influenced by other unique terms.
>
>
> --Stemming Analysis--
> Basics: 
> http://www.comp.lancs.ac.uk/computing/research/stemming/general/ 
> <http://www.comp.lancs.ac.uk/computing/research/stemming/general/>
>
> Is Stemming useful for us?

It is on our wish list. See, www.crosswire.org/bugs and look under 
JSword. You will see that you are on the right track!
(You might have to sign up for access to the "bugs" database)

http://www.crosswire.org/bugs/browse/JS-18   Don't index accents 
(actually, we may want to do both)
http://www.crosswire.org/bugs/browse/JS-19   Implement ICU4J (normalize 
utf-8 representation. I'm not sure if it should be nfd or something else)
http://www.crosswire.org/bugs/browse/JS-20   Add the ability to search 
transliterations of Greek and Hebrew
http://www.crosswire.org/bugs/browse/JS-21   Add the ability to search 
by word stems

>
> -It will be useful in many latin langs, to treat Accented & 
> corresponding unaccented characters as same
> - Along with stopword removal, it saves index space & search time.
> -KJV Examples where stemming can benefit:
> Query:
> +sin +lord +sight
> Returns 15 results. But missed verses with 'sins'/'sinned' eg 
> Deuteronomy 9:18 , 2 Kings 24:3

While stemming is very useful, we already support wild card searching. 
So +sin* will find sins and sinned.
What it does not find is "sung" when "sings" is searched. So stemming is 
useful.

We also support fuzzy searching. This proves marginally useful when 
searching for words whose spelling is not consistent like place names or 
is unknown. The problem with it is that it frequently produces 
surprising results. So stemming is useful here too.

>
> Query
> +harp +sing
> Returns 6 results. Missed harps/singers/singing as in 1 Kings 10:12, 1 
> Chr 13:8, 15:16 , 2Chr 5:12, 9:11, Nehemiah 12:27
>
> -If stemming is done by default, we can provide exact search operator, 
> eg exact(singers).
> In that case we can do a post-retrieval filter, or index both stemmed 
> & unstemmed content(double space needed)
>
> -- Available functionality in Lucene Analyzer jar--
>
> 1. Both Stopwords & Stemming
> Snowball based:
> English(Porter or Lovins)
> German
> French
> Dutch
> Russian
>
> Note: Another German stemming option (based on Joerg Caumanns paper): 
> org.apache.lucene.analysis.de.GermanAnalyzer
> Manfred, you may be interested to look at the stopwords & stemming in 
> it. I am guessing that snowball implementation
> is preferable.
>
> 2. Stopwords only, no Stemming(from Lucene analysis contrib package)
> Czech
> Greek (donot know if applicable to Modern or Ancient Greek Bible)
>
> 3. Only Stemming  [Volunteers can contribute stopwords]
> Snowball based:
>  Spanish
>  Portuguese
>  Italian
>  Swedish
>  Norwegian
>  Danish
>  Finnish
>
> 4. Only tokenization (Lucene analysis contrib package)
> ChineseAnalyzer (character tokenization): Better suited than 
> CJKAnalyzer (which does 2 overlapping chinese character tokenization)
>
Great analysis! Thanks!

If I recall correctly, it is possible to supply a stopword list to most 
analyzers, overriding the one that is present, allowing us to turn off 
stopwords, if so desired.

> -- Question I have now--
>
> - Changing analyzers would involve current users to reindex all bible 
> that they have indexed already. How do we manage that? Should we have 
> a version framework for indexes, and BD can force a reindex when 
> mismatch found.

Yes, we should have a version framework. I don't think we should force a 
reindex but rather offer it. The indexing operation is so compute 
intensive that it is good to let people with underpowered hardware 
choose an appropriate time for them.

Perhaps we should have a resource file for each built index which 
records the version info and other metadata about the indexing. Some 
things I can think of: the analyzer used, whether stop words were used 
or not. Also, the version of Lucene. From one version of Lucene to the 
next, the analyzers can change, necessitating a re-indexing.

>
> -Can we improve our search & index independent of sword?

Troy and I worked together on the lucene implementation in the Sword C++ 
API. It differs in that we use the SimpleAnalyzer (no stop words) and 
they use the Standard Analyzer (with stop words and a few other things). 
Also they have a field for strongs numbers and we don't.

I've been working in the C++ code and I think I would be able to 
contribute there too. So while they may not be identical it would be 
good to add like functionality there too.

Ultimately, it would be good to be able to share indexes and perhaps 
have them on the server so that people don't have to wait on an index 
operation. Because, we knew we would have to solve a versioning problem, 
we didn't go there :)

>
> -In long term, do we have any requirements & UI framework support to 
> extend search to non-bible books?

I don't know if you'd call it a requirement, but we would like to be 
able to search all appropriate books simultaneously and present the 
results perhaps as a tree. At this time our framework for it is in 
JSword, not in BibleDesktop. So, no, we don't have a UI framework at 
this time.

It would also be good to add the ability to search each commentary, 
daily devotion, dictionary, ... as well. Perhaps this would work like a 
filter on the picker for that book.

>
> Best,
> Sijo