[jsword-devel] search in jsword
DM Smith
dmsmith555 at yahoo.com
Mon Oct 9 08:00:05 MST 2006
Sijo Cherian wrote:
> All,
>
> I started working on improving search by normalizing accented
> characters(diacritic) in query.
Welcome Sijo! I am glad to have your contributions!
> Planning to use the lucene analysis contrib package for multi-lingual
> support( http://lucene.apache.org/java/docs/api/index.html).
> I am Starting with a new AnalyzerFactory (and resource prop) that
> supplies appropriate instance of Analyzer based on bible language.
> Same analyzer
> is used for indexing and for query parsing.
Yes, Lucene requires that the same analyzer be used for indexing and
searching.
And your implementation falls in with the JSword architecture!
In the resource prop, how about a default entry that is used if we have
a miss (either language is not found or reflection fails) that supplies
the current analyzer? It should be "asserted" that that analyzer must work.
> Luckily our query always have a bible in context, so we have a
> language at query time.
>
> All corrections / suggestions are welocme and appreciated !
>
> -- Tokenization Analysis--
> -Current tokenization is based on SimpleAnalyzer(non-letter based
> tokenization). It breaks for Chinese/Korean bibles.
I'm sure it also breaks on Thai.
The value of SimpleAnalyzer over the StandardAnalyzer is that it does
not throw away the stop words and it does not strip of 's at the end of
the words nor remove '.' from possible acronyms.
>
> --StopWord Analysis--
> Question I asked myself: Is removing stop-word useful for us?
From a theological perspective, stop words are frequently part of a
significant theological phrase. Such as, "in Him". We might make using
stop words a preference, but for those that use JSword for study as
opposed to lookup it is important.
When including stop words it is often good to do a prioritized search.
>
> Common words skew the results or return overwhelming results.
>
> Example Occurences in KJV of following words
> the 24100
> of 18200
> be 5500
> you 2000
> your 1300
> unto 7360
>
> Occurences in MKJV
> you 8350
> your 4600
> shall 6400
>
> Both during indexing & query parsing, the stop word are removed. So
> queries containing stopword terms will return smaller results
> and hits are influenced by other unique terms.
>
>
> --Stemming Analysis--
> Basics:
> http://www.comp.lancs.ac.uk/computing/research/stemming/general/
> <http://www.comp.lancs.ac.uk/computing/research/stemming/general/>
>
> Is Stemming useful for us?
It is on our wish list. See, www.crosswire.org/bugs and look under
JSword. You will see that you are on the right track!
(You might have to sign up for access to the "bugs" database)
http://www.crosswire.org/bugs/browse/JS-18 Don't index accents
(actually, we may want to do both)
http://www.crosswire.org/bugs/browse/JS-19 Implement ICU4J (normalize
utf-8 representation. I'm not sure if it should be nfd or something else)
http://www.crosswire.org/bugs/browse/JS-20 Add the ability to search
transliterations of Greek and Hebrew
http://www.crosswire.org/bugs/browse/JS-21 Add the ability to search
by word stems
>
> -It will be useful in many latin langs, to treat Accented &
> corresponding unaccented characters as same
> - Along with stopword removal, it saves index space & search time.
> -KJV Examples where stemming can benefit:
> Query:
> +sin +lord +sight
> Returns 15 results. But missed verses with 'sins'/'sinned' eg
> Deuteronomy 9:18 , 2 Kings 24:3
While stemming is very useful, we already support wild card searching.
So +sin* will find sins and sinned.
What it does not find is "sung" when "sings" is searched. So stemming is
useful.
We also support fuzzy searching. This proves marginally useful when
searching for words whose spelling is not consistent like place names or
is unknown. The problem with it is that it frequently produces
surprising results. So stemming is useful here too.
>
> Query
> +harp +sing
> Returns 6 results. Missed harps/singers/singing as in 1 Kings 10:12, 1
> Chr 13:8, 15:16 , 2Chr 5:12, 9:11, Nehemiah 12:27
>
> -If stemming is done by default, we can provide exact search operator,
> eg exact(singers).
> In that case we can do a post-retrieval filter, or index both stemmed
> & unstemmed content(double space needed)
>
> -- Available functionality in Lucene Analyzer jar--
>
> 1. Both Stopwords & Stemming
> Snowball based:
> English(Porter or Lovins)
> German
> French
> Dutch
> Russian
>
> Note: Another German stemming option (based on Joerg Caumanns paper):
> org.apache.lucene.analysis.de.GermanAnalyzer
> Manfred, you may be interested to look at the stopwords & stemming in
> it. I am guessing that snowball implementation
> is preferable.
>
> 2. Stopwords only, no Stemming(from Lucene analysis contrib package)
> Czech
> Greek (donot know if applicable to Modern or Ancient Greek Bible)
>
> 3. Only Stemming [Volunteers can contribute stopwords]
> Snowball based:
> Spanish
> Portuguese
> Italian
> Swedish
> Norwegian
> Danish
> Finnish
>
> 4. Only tokenization (Lucene analysis contrib package)
> ChineseAnalyzer (character tokenization): Better suited than
> CJKAnalyzer (which does 2 overlapping chinese character tokenization)
>
Great analysis! Thanks!
If I recall correctly, it is possible to supply a stopword list to most
analyzers, overriding the one that is present, allowing us to turn off
stopwords, if so desired.
> -- Question I have now--
>
> - Changing analyzers would involve current users to reindex all bible
> that they have indexed already. How do we manage that? Should we have
> a version framework for indexes, and BD can force a reindex when
> mismatch found.
Yes, we should have a version framework. I don't think we should force a
reindex but rather offer it. The indexing operation is so compute
intensive that it is good to let people with underpowered hardware
choose an appropriate time for them.
Perhaps we should have a resource file for each built index which
records the version info and other metadata about the indexing. Some
things I can think of: the analyzer used, whether stop words were used
or not. Also, the version of Lucene. From one version of Lucene to the
next, the analyzers can change, necessitating a re-indexing.
>
> -Can we improve our search & index independent of sword?
Troy and I worked together on the lucene implementation in the Sword C++
API. It differs in that we use the SimpleAnalyzer (no stop words) and
they use the Standard Analyzer (with stop words and a few other things).
Also they have a field for strongs numbers and we don't.
I've been working in the C++ code and I think I would be able to
contribute there too. So while they may not be identical it would be
good to add like functionality there too.
Ultimately, it would be good to be able to share indexes and perhaps
have them on the server so that people don't have to wait on an index
operation. Because, we knew we would have to solve a versioning problem,
we didn't go there :)
>
> -In long term, do we have any requirements & UI framework support to
> extend search to non-bible books?
I don't know if you'd call it a requirement, but we would like to be
able to search all appropriate books simultaneously and present the
results perhaps as a tree. At this time our framework for it is in
JSword, not in BibleDesktop. So, no, we don't have a UI framework at
this time.
It would also be good to add the ability to search each commentary,
daily devotion, dictionary, ... as well. Perhaps this would work like a
filter on the picker for that book.
>
> Best,
> Sijo
More information about the jsword-devel
mailing list