[jsword-devel] Lucene 2.9 and JSword
DM Smith
dmsmith at crosswire.org
Mon Nov 23 10:43:35 MST 2009
On 11/23/2009 11:28 AM, Tonny Kohar wrote:
> Hi,
>
> As I noticed JSword is now using Apache Lucene 2.9,
Yes and no. I've upgraded trunk to use Lucene 2.9.0. There's a serious
bug in it and we need to update to 2.9.1. (Should be just a jar replacement)
We still have a bit more work before we can release based on it.
Specifically, I'm still working on the changes to the Lucene analyzers
to bring them up to speed with all the changes to 2.9.
One of the problems with 2.9 is that it will invalidate nearly all prior
indexes. If I'm not mistaken only simple ASCII text is unaffected. As a
result, we need to put a mechanism in place to handle this. There is a
rudimentary mechanism, but only handles version specific features and
does not store any version information along with the index.
Basically we need to create a manifest for each index outlining each
contributing component and its compatibility version. If from one
release to the next any component's version is different, then the index
needs to be invalidated. It may be that searches will work for the most
part, but they won't be as accurate. Some of the things that might need
to be included in the manifest:
The VM's level of unicode support. Just found out that moving from Java
1.4.x to Java 5 may cause problems. Some of the properties of characters
changed and this affects tokenization.
Lucene's tokenizer and/or analyzer in use for the index. In most cases
we use SimpleAnalyzer and this is still OK. However, some use
StandardAnalyzer and this has subtle changes. The StandardAnalyzer is
really not appropriate for JSword, so I am changing these to SimpleAnalyzer.
Some indexes can be created with stopword lists. If the list changes,
then the index can be bad.
And so forth.
So for the next release, I'm thinking of adding version info to newly
created indexes and any that don't have version info are invalid.
Invalid will be a flag which the frontend can ignore.
> and I was reading
> an interesting article regarding Lucene 2.9 especially the "Term
> Vector based highligther" as follow
>
> "Term vector-based highlighter: a new term highlighter implementation based on
> term vectors (essentially a view of terms, offsets, and positions in a documents
> field). It supports features like N-Gram fields and phrase-unit
> highlighting with
> slops and yields good performance on large documents. The downside is that it
> requires a lot more disk space due to stored term vectors."
>
Currently, AFAICT, we don't store term vectors. This too would
invalidate the index. I don't have an issue with storing them as an
option to the user. For most users even with limited space, it is not an
issue because they only index a few Bibles.
> then, I was thinking is this new feature from Lucene 2.9 can be used
> to provide JSword search highlight features ?
>
sounds good to me.
> The reason I ask this because I do not know much regarding Lucene 2.9,
> and because it seem easy enough (correct me, if I am wrong, the hard
> work has been provided by lucene itself) just add the word/term offset
> to the index then retrieve back during search, and apply the highlight
> to the output html/xml.
>
> The question are:
> - is my assumption correct ?
>
I don't know, but hope so.
> - is it can be used for languange other than english ?
>
All languages. But it may require a correct tokenizer/analyzer pair.
Each release of Lucene improves this. And that is what I'm trying to
bring into JSword, now.
> - does UTF-8 (or the text encoding used by crosswire module) allow
> offset/byte counting ?
>
The encoding of a CrossWire module is either cp1252 (a MS variant of
Latin-1) or UTF-8. Once JSword reads it into a String, it is UTF-16.
This is the "char" of Java. So one would not do byte counting but char
counting. The only gotcha that I am aware is the problem of surrogate
pairs, where two chars are needed to be treated as one. (That's about my
full knowledge of surrogate pairs :)
The Lucene folks are very conscious of the surrogate pair issue and the
offset and lengths that they give account for them.
Hope this helps.
Many thanks for all your efforts!
In Him,
DM
More information about the jsword-devel
mailing list