[jsword-devel] Big search check-in
DM Smith
dmsmith555 at yahoo.com
Sat Oct 2 04:25:26 MST 2004
Joe,
I saw your posting on Sword asking about cached Lucene indexes.
I don't think that we will want to use the index because we may wish to
index more or less. (e.g. Strong Numbers)
That aside, I think that one of the problems with how we do indexing
deals with how we get the text of a verse.
(Correct me if I am wrong, as I am doing this from the top of my head).
Each verse is gotten from local disk via
determine the block
see if the block is in memory cache
if not
then
open
seek to start of block
read block
close file
endif
return verse from cache.
Once the verse is gotten it is parsed with a recursive descent parser.
The parsing transforms the passage into an OSIS DOM.
The OSIS DOM is examined for the text and just the text is returned.
There are several expenses here that are not necessary if all we care
about is the text in the verse. And in this case we could use a shared
cached index, kept on the server. (I think that we want to do more than
just text indexing as the request of cross referencing Strongs would
require more)
1) We get verse at a time. It would be better to get a range at a time.
The technical roadblock is that the module may not contain the verse
markers. The most recent OSIS modules actually have the verse markers. I
don't remember seeing them in any other modules. If we could synthesize
the absent verse markers in each markup then we can get around this. I
am not sure what the boundaries of the range should be (testament, book,
chapter, # of verses). It does not need to match the "blocking" of the
module, but it may be prudent to take it into consideration.
2) It would be better if the getting of the verses was a streaming,
co-process that did not consume large amounts of memory. It seems that
the caching is only necessary for modules that are zipped. I don't know
if they need to stay packed on local disk. Maybe space is tight on an
old laptop?
3) The parsing cares about everything, verse, notes, strongs, .... If we
had parsers that would ignore everything but verse text, that would be
better. Perhaps a parsing mode.
4) Building a DOM is unnecessary if it is going to be thrown away.
5) The building of the index is serial. Can it be done in parallel
threads? Can we do the OT and the NT at the same time? Or is it just too
much to ask for an old laptop? We should have at least two threads
(producer that gets the text to be indexed and consumer that indexes the
text).
Another thought on the interface wrt indexing. We have talked about
indexing in the background on a low priority thread immediately after
downloading. Can we set a state on the BookMetaData isIndexed() which
would return true if the indexing is complete. The search
functionalities would be disabled if the selected bible is not indexed
and they would listen to the BookMetaData for IndexEvents (or a bound
property event) to dynamically change state when the index is done.
If indexing is slow, we would need to know if the index is completed.
For example, the user (or OS) could shut down the application while
indexing. Then that index would need to be rebuilt or the indexing would
need to pick up where it left off. We may want to do this for post 1.0.
Finally, could the program zip up the index and upload it to the
crosswire server if it were not there yet for that version of the
module? That way everyone gets to share. We should probably name the
index with a version number so that if we change the indexing scheme we
can notice that the index is obsolete and needs to be rebuilt.
DM
Joe Walker wrote:
>Hi,
>
>My search changes have landed, and three bugs have been fixed, JS-5,6 and 7.
>I'm going to tinker with the search UI next as previously noted, and
>carry on with JS-1 to attempt to implement Thesaurus in Lucene.
>
>Joe.
>
More information about the jsword-devel
mailing list