[jsword-devel] Big search check-in

Sat Oct 2 04:25:26 MST 2004

Joe,

I saw your posting on Sword asking about cached Lucene indexes.
I don't think that we will want to use the index because we may wish to 
index more or less. (e.g. Strong Numbers)

That aside, I think that one of the problems with how we do indexing 
deals with how we get the text of a verse.
(Correct me if I am wrong, as I am doing this from the top of my head).
Each verse is gotten from local disk via
    determine the block
    see if the block is in memory cache
    if not
    then
        open
        seek to start of block
        read block
        close file
   endif
   return verse from cache.

Once the verse is gotten it is parsed with a recursive descent parser.

The parsing transforms the passage into an OSIS DOM.

The OSIS DOM is examined for the text and just the text is returned.

There are several expenses here that are not necessary if all we care 
about is the text in the verse. And in this case we could use a shared 
cached index, kept on the server.  (I think that we want to do more than 
just text indexing as the request of cross referencing Strongs would 
require more)

1) We get verse at a time. It would be better to get a range at a time. 
The technical roadblock is that the module may not contain the verse 
markers. The most recent OSIS modules actually have the verse markers. I 
don't remember seeing them in any other modules. If we could synthesize 
the absent verse markers in each markup then we can get around this. I 
am not sure what the boundaries of the range should be (testament, book, 
chapter, # of verses). It does not need to match the "blocking" of the 
module, but it may be prudent to take it into consideration.

2) It would be better if the getting of the verses was a streaming, 
co-process that did not consume large amounts of memory. It seems that 
the caching is only necessary for modules that are zipped. I don't know 
if they need to stay packed on local disk. Maybe space is tight on an 
old laptop?

3) The parsing cares about everything, verse, notes, strongs, .... If we 
had parsers that would ignore everything but verse text, that would be 
better. Perhaps a parsing mode.

4) Building a DOM is unnecessary if it is going to be thrown away.

5) The building of the index is serial. Can it be done in parallel 
threads? Can we do the OT and the NT at the same time? Or is it just too 
much to ask for an old laptop? We should have at least two threads 
(producer that gets the text to be indexed and consumer that indexes the 
text).

Another thought on the interface wrt indexing. We have talked about 
indexing in the background on a low priority thread immediately after 
downloading. Can we set a state on the BookMetaData isIndexed() which 
would return true if the indexing is complete. The search 
functionalities would be disabled if the selected bible is not indexed 
and they would listen to the BookMetaData for IndexEvents (or a bound 
property event) to dynamically change state when the index is done.

If indexing is slow, we would need to know if the index is completed. 
For example, the user (or OS) could shut down the application while 
indexing. Then that index would need to be rebuilt or the indexing would 
need to pick up where it left off. We may want to do this for post 1.0.

Finally, could the program zip up the index and upload it to the 
crosswire server if it were not there yet for that version of the 
module? That way everyone gets to share. We should probably name the 
index with a version number so that if we change the indexing scheme we 
can notice that the index is obsolete and needs to be rebuilt.

DM

Joe Walker wrote:

>Hi,
>
>My search changes have landed, and three bugs have been fixed, JS-5,6 and 7.
>I'm going to tinker with the search UI next as previously noted, and
>carry on with JS-1 to attempt to implement Thesaurus in Lucene.
>
>Joe.
>