[jsword-devel] Big search check-in

Joe Walker joseph.walker at gmail.com
Sun Oct 3 13:28:37 MST 2004


I was imagining that we would keep the Strongs number index separate.
That way it can be used with any version and not just with versions
that had been marked up.
However I think that you are right about sharing indexes. Even if they
are totally compatible, there is a very good chance that we will have
slightly differing requirements, or differing upgrade times.

I think the list of optimizations is probably dead-on. The top
priority for me is to make indexes downloadable, because we would need
to do some stunning optimizations to make indexing keep up with
downloading on broadband, and even keeping up with a dial-up would be
a challenge. We probably need to run a profiller across an index run
to know for sure where the best gains are to be had.

I like the isIndexed() idea. The IndexManager can now index a book at
any time so maybe the Book manager could have a way of inicating index
state and kicking off an index generation?

I'm not keen on the idea of uploading an index if one does not exist
on the server. There are all sorts of security issues and checking
could be as costly as indexing ourselves. We could easily have a cron
job to keep indexes up to date and re-gen out of date indexes as
required. (We would need to do the initial generation run to avoid
killing Crosswire)

Thanks,

Joe.


On Sat, 02 Oct 2004 07:25:26 -0400, DM Smith <dmsmith555 at yahoo.com> wrote:
> Joe,
> 
> I saw your posting on Sword asking about cached Lucene indexes.
> I don't think that we will want to use the index because we may wish to
> index more or less. (e.g. Strong Numbers)
>
> That aside, I think that one of the problems with how we do indexing
> deals with how we get the text of a verse.
> (Correct me if I am wrong, as I am doing this from the top of my head).
> Each verse is gotten from local disk via
>     determine the block
>     see if the block is in memory cache
>     if not
>     then
>         open
>         seek to start of block
>         read block
>         close file
>    endif
>    return verse from cache.
> 
> Once the verse is gotten it is parsed with a recursive descent parser.
> 
> The parsing transforms the passage into an OSIS DOM.
> 
> The OSIS DOM is examined for the text and just the text is returned.
> 
> There are several expenses here that are not necessary if all we care
> about is the text in the verse. And in this case we could use a shared
> cached index, kept on the server.  (I think that we want to do more than
> just text indexing as the request of cross referencing Strongs would
> require more)
> 
> 1) We get verse at a time. It would be better to get a range at a time.
> The technical roadblock is that the module may not contain the verse
> markers. The most recent OSIS modules actually have the verse markers. I
> don't remember seeing them in any other modules. If we could synthesize
> the absent verse markers in each markup then we can get around this. I
> am not sure what the boundaries of the range should be (testament, book,
> chapter, # of verses). It does not need to match the "blocking" of the
> module, but it may be prudent to take it into consideration.
> 
> 2) It would be better if the getting of the verses was a streaming,
> co-process that did not consume large amounts of memory. It seems that
> the caching is only necessary for modules that are zipped. I don't know
> if they need to stay packed on local disk. Maybe space is tight on an
> old laptop?
> 
> 3) The parsing cares about everything, verse, notes, strongs, .... If we
> had parsers that would ignore everything but verse text, that would be
> better. Perhaps a parsing mode.
> 
> 4) Building a DOM is unnecessary if it is going to be thrown away.
> 
> 5) The building of the index is serial. Can it be done in parallel
> threads? Can we do the OT and the NT at the same time? Or is it just too
> much to ask for an old laptop? We should have at least two threads
> (producer that gets the text to be indexed and consumer that indexes the
> text).
> 
> Another thought on the interface wrt indexing. We have talked about
> indexing in the background on a low priority thread immediately after
> downloading. Can we set a state on the BookMetaData isIndexed() which
> would return true if the indexing is complete. The search
> functionalities would be disabled if the selected bible is not indexed
> and they would listen to the BookMetaData for IndexEvents (or a bound
> property event) to dynamically change state when the index is done.
> 
> If indexing is slow, we would need to know if the index is completed.
> For example, the user (or OS) could shut down the application while
> indexing. Then that index would need to be rebuilt or the indexing would
> need to pick up where it left off. We may want to do this for post 1.0.
> 
> Finally, could the program zip up the index and upload it to the
> crosswire server if it were not there yet for that version of the
> module? That way everyone gets to share. We should probably name the
> index with a version number so that if we change the indexing scheme we
> can notice that the index is obsolete and needs to be rebuilt.
> 
> DM
> 
> 
> 
> Joe Walker wrote:
> 
> >Hi,
> >
> >My search changes have landed, and three bugs have been fixed, JS-5,6 and 7.
> >I'm going to tinker with the search UI next as previously noted, and
> >carry on with JS-1 to attempt to implement Thesaurus in Lucene.
> >
> >Joe.
> >
> 
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>


More information about the jsword-devel mailing list