[jsword-devel] Lucene index help

DM Smith dmsmith at crosswire.org
Thu Nov 4 12:20:22 MST 2010


I thought I'd outline the variables that go into building an index that 
can be reliably searched.

The unit of an index is called a Document. In our case, this is a verse, 
a dictionary entry, ..., anything that SWORD identifies as a keyed piece 
of information.

A document consists of one or more fields having content. We have 
several: content, note, strong, ..., with content being the default. One 
specifies a field search as with:
     strong:H3068
If no field is specified, the default is searched.

It is possible to have fields that are not searchable but merely store 
information. We do this with the key for the item. That way, we know 
what we have gotten back from a search.

Each field is built independently from the others. Each has its own 
analyzer. Though two fields might have the same, it is best to ignore 
that. An analyzer consists of a tokenizer which splits up text into a 
stream of tokens and filters applied to that stream.

Generally we think of tokens as words, but it is a bit more complicated 
than that. The StandardAnalyzer also tries to identify phone numbers, 
acronyms, URLs, and other common English constructs. It also uses white 
space as word delimiters and punctuation as a hint to word boundaries. 
SWORD uses the StandardAnalyzer, but JSword does not. It uses a 
variation of the SimpleAnalyzer, which uses white space and punctuation 
as boundaries.

One of the complexities is that not all languages use the same rules for 
word boundaries. E.g. Chinese uses one glyph per word and spaces are not 
used to separate words. Thai does not use white space and it looks as if 
all the letters are run together. Ancient Greek also ran words together 
consisting of all capital letters. These need to be treated specially. 
The typical approach with Chinese, is to do bi-gram searching. That is, 
if looking for a single glyph, look for just that, but if looking for 
multiple glyphs, break it up into pairs and look for those pairs. For 
Thai, the typical approach is to use a dictionary and Unicode tables to 
break up the string of characters into words.

Regarding Unicode, it has progressed over the years with several 
versions being released. And each version of Java implements a version 
of Unicode.  Quoting the JLS: 
(http://java.sun.com/docs/books/jls/third_edition/html/lexical.html)
> Versions of the Java programming language prior to 1.1 used Unicode 
> version 1.1.5. Upgrades to newer versions of the Unicode Standard 
> occurred in JDK 1.1 (to Unicode 2.0), JDK 1.1.7 (to Unicode 2.1), J2SE 
> 1.4 (to Unicode 3.0), and J2SE 5.0 (to Unicode 4.0).
 From what I understand Java 6 is also Unicode 4.0 and Java 7 is planned 
to be Unicode 5.1.

Turns out that IBM's Java does not have the same implementation as Sun 
(now Oracle). I don't know about Android's Java, which seems to be a 
variant of Harmony.

 From a practical level this means that if there is a reliance on a 
specific version and comparable implementation of Unicode then one needs 
to stick with that. That is an index built with Java 1.4 may not be the 
same as one built with Java 5. And one built with Java 5 might not be 
the same as one built with Harmony/Android.

This is especially true with Thai. IBM's Java does not have a decent 
break iterator while Sun's does. Who knows about Harmony? One way to get 
around this is to use ICU4J to do the analysis. And another is a 
UAX29Tokenizer, which tokenizes based on the Unicode tables (5.0 if I 
remember). With Lucene 3.0 and later, this is possible. And the code 
might be able to be back ported to 2.9.x series.

Once tokens are identified, they are filtered. There are various filters 
that might be used (note, I'm doing this from memory so the names are a 
bit off. It will give the idea):
LowercaseFilter - lower case the letters. Note, some languages don't 
have both upper and lower.
FoldingFilter - removes accents, diacriticals, pointing and the like. 
There are some specific language foldings such as final sigma to sigma.
CompoundWordFilter - splits compound words into parts. Very useful for 
German. Typically this is dictionary based.
StopWordFilter - remove noise words such as a, the, in, ..., as provided 
in a list. From a theological perspective, "In Christ" might be an 
important find.
*StemmingFilter - Converts word into their stem/root form. Note, this is 
language specific, thus the *. Generally this is rule based, but 
sometimes is dictionary driven.
Note that the StandardAnalyzer filters stop words, but does no folding 
or stemming.

ICU normalization generally needs to precede filtering. There are two 
basic forms: Composed (NFC) and decomposed (NFD). For each of these 
there is a canonical 'K' variant, NFKC and NFKD. Simplistically, with 
decomposed, the base character is followed by its decorations (e.g. 
accents). In composed form, they are combined into one. Our modules, in 
order to conserve space, typically do NFC.

Some things to note:
The order of filters may be important.
The tables that the filters use are important.

Each release of Lucene has varied one or more of these things. 
Typically, this is due to a bug fix.

The goal is that a token stream for a field be the same for indexing as 
for searching. If they differ, results are unpredictable.
So to put this together, if any of the following changes, all bets are off:
     Tokenizer (i.e. which tokenizer is used)
     The rules that a tokenizer uses to break into tokens.
     The type associated with each token (e.g. word, number, url, .... 
We ignore this so it doesn't matter.)
     Presence/Absence of a particular filter
     Order of filters
     Tables that a filter uses
     Rules that a filter encodes
     The version and implementation of Unicode being used (whether via 
ICU, Lucene and/or Java)
     The version of Lucene (e.g. every version of Lucene has fixed bugs 
in these components.)
And if we did highlighting:
     The relative position of each token.

What is planned is to record this along with the index in a manifest. If 
the manifest changes for a field that is exposed through the app, then 
the index should be rebuilt. It may be that JSword will be able to adapt 
to a particular manifest. For example, it is under programmers control 
whether stemming and/or stop filters are applied. (Right now, defaults 
are assumed.) If an index is built with stemming but not stop words, 
then the searching analyzer can be built to match. It won't matter that 
the list of stop words has changed.

To have them on a server means that we need to keep versions around and 
to be able to request them. We need to support people who don't upgrade 
their app to match the latest index.

Hope this is helpful.

In Him,
     DM


On 11/04/2010 11:26 AM, Martin Denham wrote:
> Does anybody know any reason why a search for 'blessed' does not 
> return any search results in ESV but searching for 'bless' work perfectly?
>
> When I download  BibleDesktop (JSword) generated indexes to And Bible 
> I have noticed that some searches like 'blessed' stop working but I 
> can't figure out what the problem is and would appreciate some 
> pointers as to areas to look.
>
> I have checked that the correct Analyzer is being used but I am not 
> sure what else to check or if the 'blessed'/'bless' issue might point 
> to a specific problem area.
>
> The plan is to download pre-created indexes to And Bible and in theory 
> those indexes should be generated by JSword but currently And Bible 
> can only use indexes it creates itself or which have been created by 
> CLucene/Sword.
>
> All advice, opinions, and comments are appreciated.
>
> Many thanks
> Martin
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20101104/27126290/attachment.html>


More information about the jsword-devel mailing list