[jsword-devel] Lucene 2.9 and JSword

Mon Nov 23 10:43:35 MST 2009

On 11/23/2009 11:28 AM, Tonny Kohar wrote:
> Hi,
>
> As I noticed JSword is now using Apache Lucene 2.9,
Yes and no. I've upgraded trunk to use Lucene 2.9.0. There's a serious 
bug in it and we need to update to 2.9.1. (Should be just a jar replacement)

We still have a bit more work before we can release based on it. 
Specifically, I'm still working on the changes to the Lucene analyzers 
to bring them up to speed with all the changes to 2.9.

One of the problems with 2.9 is that it will invalidate nearly all prior 
indexes. If I'm not mistaken only simple ASCII text is unaffected. As a 
result, we need to put a mechanism in place to handle this. There is a 
rudimentary mechanism, but only handles version specific features and 
does not store any version information along with the index.

Basically we need to create a manifest for each index outlining each 
contributing component and its compatibility version. If from one 
release to the next any component's version is different, then the index 
needs to be invalidated. It may be that searches will work for the most 
part, but they won't be as accurate. Some of the things that might need 
to be included in the manifest:
The VM's level of unicode support. Just found out that moving from Java 
1.4.x to Java 5 may cause problems. Some of the properties of characters 
changed and this affects tokenization.

Lucene's tokenizer and/or analyzer in use for the index. In most cases 
we use SimpleAnalyzer and this is still OK. However, some use 
StandardAnalyzer and this has subtle changes. The StandardAnalyzer is 
really not appropriate for JSword, so I am changing these to SimpleAnalyzer.

Some indexes can be created with stopword lists. If the list changes, 
then the index can be bad.

And so forth.

So for the next release, I'm thinking of adding version info to newly 
created indexes and any that don't have version info are invalid.

Invalid will be a flag which the frontend can ignore.

>   and I was reading
> an interesting article regarding Lucene 2.9 especially the "Term
> Vector based highligther" as follow
>
> "Term vector-based highlighter: a new term highlighter implementation based on
> term vectors (essentially a view of terms, offsets, and positions in a documents
> field). It supports features like N-Gram fields and phrase-unit
> highlighting with
> slops and yields good performance on large documents. The downside is that it
> requires a lot more disk space due to stored term vectors."
>    
Currently, AFAICT, we don't store term vectors. This too would 
invalidate the index. I don't have an issue with storing them as an 
option to the user. For most users even with limited space, it is not an 
issue because they only index a few Bibles.

> then, I was thinking is this new feature from Lucene 2.9 can be used
> to provide JSword search highlight features ?
>    

sounds good to me.

> The reason I ask this because I do not know much regarding Lucene 2.9,
> and because it seem easy enough (correct me, if I am wrong, the hard
> work has been provided by lucene itself) just add the word/term offset
> to the index then retrieve back during search, and apply the highlight
> to the output html/xml.
>    
> The question are:
> - is my assumption correct ?
>    
I don't know, but hope so.

> - is it can be used for languange other than english ?
>    
All languages. But it may require a correct tokenizer/analyzer pair. 
Each release of Lucene improves this. And that is what I'm trying to 
bring into JSword, now.

> - does UTF-8 (or the text encoding used by crosswire module) allow
> offset/byte counting ?
>    
The encoding of a CrossWire module is either cp1252 (a MS variant of 
Latin-1) or UTF-8. Once JSword reads it into a String, it is UTF-16. 
This is the "char" of Java. So one would not do byte counting but char 
counting. The only gotcha that I am aware is the problem of surrogate 
pairs, where two chars are needed to be treated as one. (That's about my 
full knowledge of surrogate pairs :)

The Lucene folks are very conscious of the surrogate pair issue and the 
offset and lengths that they give account for them.

Hope this helps.

Many thanks for all your efforts!

In Him,
     DM