[jsword-devel] Lucene index help

Thu Nov 4 12:42:13 MST 2010

Thanks DM.
The multi-variable dependency description of the index was helpful.

On Thu, Nov 4, 2010 at 3:20 PM, DM Smith <dmsmith at crosswire.org> wrote:

>  I thought I'd outline the variables that go into building an index that
> can be reliably searched.
>
> The unit of an index is called a Document. In our case, this is a verse, a
> dictionary entry, ..., anything that SWORD identifies as a keyed piece of
> information.
>
> A document consists of one or more fields having content. We have several:
> content, note, strong, ..., with content being the default. One specifies a
> field search as with:
>     strong:H3068
> If no field is specified, the default is searched.
>
> It is possible to have fields that are not searchable but merely store
> information. We do this with the key for the item. That way, we know what we
> have gotten back from a search.
>
> Each field is built independently from the others. Each has its own
> analyzer. Though two fields might have the same, it is best to ignore that.
> An analyzer consists of a tokenizer which splits up text into a stream of
> tokens and filters applied to that stream.
>
> Generally we think of tokens as words, but it is a bit more complicated
> than that. The StandardAnalyzer also tries to identify phone numbers,
> acronyms, URLs, and other common English constructs. It also uses white
> space as word delimiters and punctuation as a hint to word boundaries. SWORD
> uses the StandardAnalyzer, but JSword does not. It uses a variation of the
> SimpleAnalyzer, which uses white space and punctuation as boundaries.
>
> One of the complexities is that not all languages use the same rules for
> word boundaries. E.g. Chinese uses one glyph per word and spaces are not
> used to separate words. Thai does not use white space and it looks as if all
> the letters are run together. Ancient Greek also ran words together
> consisting of all capital letters. These need to be treated specially. The
> typical approach with Chinese, is to do bi-gram searching. That is, if
> looking for a single glyph, look for just that, but if looking for multiple
> glyphs, break it up into pairs and look for those pairs. For Thai, the
> typical approach is to use a dictionary and Unicode tables to break up the
> string of characters into words.
>
> Regarding Unicode, it has progressed over the years with several versions
> being released. And each version of Java implements a version of Unicode.
> Quoting the JLS: (
> http://java.sun.com/docs/books/jls/third_edition/html/lexical.html)
>
> Versions of the Java programming language prior to 1.1 used Unicode version
> 1.1.5. Upgrades to newer versions of the Unicode Standard occurred in JDK
> 1.1 (to Unicode 2.0), JDK 1.1.7 (to Unicode 2.1), J2SE 1.4 (to Unicode 3.0),
> and J2SE 5.0 (to Unicode 4.0).
>
> From what I understand Java 6 is also Unicode 4.0 and Java 7 is planned to
> be Unicode 5.1.
>
> Turns out that IBM's Java does not have the same implementation as Sun (now
> Oracle). I don't know about Android's Java, which seems to be a variant of
> Harmony.
>
> From a practical level this means that if there is a reliance on a specific
> version and comparable implementation of Unicode then one needs to stick
> with that. That is an index built with Java 1.4 may not be the same as one
> built with Java 5. And one built with Java 5 might not be the same as one
> built with Harmony/Android.
>
> This is especially true with Thai. IBM's Java does not have a decent break
> iterator while Sun's does. Who knows about Harmony? One way to get around
> this is to use ICU4J to do the analysis. And another is a UAX29Tokenizer,
> which tokenizes based on the Unicode tables (5.0 if I remember). With Lucene
> 3.0 and later, this is possible. And the code might be able to be back
> ported to 2.9.x series.
>
> Once tokens are identified, they are filtered. There are various filters
> that might be used (note, I'm doing this from memory so the names are a bit
> off. It will give the idea):
> LowercaseFilter - lower case the letters. Note, some languages don't have
> both upper and lower.
> FoldingFilter - removes accents, diacriticals, pointing and the like. There
> are some specific language foldings such as final sigma to sigma.
> CompoundWordFilter - splits compound words into parts. Very useful for
> German. Typically this is dictionary based.
> StopWordFilter - remove noise words such as a, the, in, ..., as provided in
> a list. From a theological perspective, "In Christ" might be an important
> find.
> *StemmingFilter - Converts word into their stem/root form. Note, this is
> language specific, thus the *. Generally this is rule based, but sometimes
> is dictionary driven.
> Note that the StandardAnalyzer filters stop words, but does no folding or
> stemming.
>
> ICU normalization generally needs to precede filtering. There are two basic
> forms: Composed (NFC) and decomposed (NFD). For each of these there is a
> canonical 'K' variant, NFKC and NFKD. Simplistically, with decomposed, the
> base character is followed by its decorations (e.g. accents). In composed
> form, they are combined into one. Our modules, in order to conserve space,
> typically do NFC.
>
> Some things to note:
> The order of filters may be important.
> The tables that the filters use are important.
>
> Each release of Lucene has varied one or more of these things. Typically,
> this is due to a bug fix.
>
> The goal is that a token stream for a field be the same for indexing as for
> searching. If they differ, results are unpredictable.
> So to put this together, if any of the following changes, all bets are off:
>     Tokenizer (i.e. which tokenizer is used)
>     The rules that a tokenizer uses to break into tokens.
>     The type associated with each token (e.g. word, number, url, .... We
> ignore this so it doesn't matter.)
>     Presence/Absence of a particular filter
>     Order of filters
>     Tables that a filter uses
>     Rules that a filter encodes
>     The version and implementation of Unicode being used (whether via ICU,
> Lucene and/or Java)
>     The version of Lucene (e.g. every version of Lucene has fixed bugs in
> these components.)
> And if we did highlighting:
>     The relative position of each token.
>
> What is planned is to record this along with the index in a manifest. If
> the manifest changes for a field that is exposed through the app, then the
> index should be rebuilt. It may be that JSword will be able to adapt to a
> particular manifest. For example, it is under programmers control whether
> stemming and/or stop filters are applied. (Right now, defaults are assumed.)
> If an index is built with stemming but not stop words, then the searching
> analyzer can be built to match. It won't matter that the list of stop words
> has changed.
>
> To have them on a server means that we need to keep versions around and to
> be able to request them. We need to support people who don't upgrade their
> app to match the latest index.
>
> Hope this is helpful.
>
> In Him,
>     DM
>
>
>
> On 11/04/2010 11:26 AM, Martin Denham wrote:
>
> Does anybody know any reason why a search for 'blessed' does not return any
> search results in ESV but searching for 'bless' work perfectly?
>
>  When I download  BibleDesktop (JSword) generated indexes to And Bible I
> have noticed that some searches like 'blessed' stop working but I can't
> figure out what the problem is and would appreciate some pointers as to
> areas to look.
>
>  I have checked that the correct Analyzer is being used but I am not sure
> what else to check or if the 'blessed'/'bless' issue might point to a
> specific problem area.
>
>  The plan is to download pre-created indexes to And Bible and in theory
> those indexes should be generated by JSword but currently And Bible can only
> use indexes it creates itself or which have been created by CLucene/Sword.
>
>  All advice, opinions, and comments are appreciated.
>
>  Many thanks
> Martin
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.orghttp://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>

-- 
Regards,
Sijo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20101104/2009814c/attachment-0001.html>