[jsword-devel] Big search check-in

Sun Oct 17 02:41:19 MST 2004

I spent a few minutes thinking about the Lucene Indexing process.

One of the big inefficiencies is in constructing an OSIS DOM tree (as
you noted DM). I pondered adding a getPlainText(Key) method to Book
(note 1)
But this rather destroys one of the main benefits of moving to Lucene
- that you can index several bits of data against some document (read
verse in our case). If we insist on pumping text only into Lucene then
we are saying we don't need that functionality that we thought would
be useful only a few weeks ago.
Fundamentally OSIS should be a content only description of everything
we need to know about the content. This seems like the ideal thing to
index really.

Secondly we could grab larger chunks of data, so rather than attacking
the problem one verse and a time we could attack it one
chapter/book/Bible at a time. By doing this I think that we are likely
to save of the creation of a few Osis preamble Elements, but lose on
larger memory footprint and in having to separate the data into
separate verse elements once it was retrieved. This sounds quite
complex to me.
I suppose the ultimate system would be to index the entire Bible in a
single hit and make use of a SAX streaming API to reduce memory
footprint and drive the indexing process via SAX. Post 1.0 probably!

So in Summary: I think that the current index system should be
optimized at a micro level and not a macro level.

Joe.

Note 1. getPlainText(Key) could be easily implmeneted by default in
AbstractBook by calling getPlainText() on the BookData from
Book.getData(Key), and then overridden in SwordBook to save building
the DOM tree.