[jsword-devel] Search Highlight

Tonny Kohar tonny.kohar at gmail.com
Sun Apr 26 20:58:16 MST 2009


On Sat, Apr 25, 2009 at 6:29 PM, DM Smith <dmsmith at crosswire.org> wrote:
> Now you see why I haven't tackled it;) Maybe why others have not either.
> Yes, the way search works in JSword is that the Lucene index is searched and
> from each Lucene Document (i.e. a verse) the reference is retrieved and is
> converted into a JSword Key. The result is a list of hits. The list is then
> used to retrieve raw text from the module. This then is assembled into an
> xml OSIS document (possibly by transforming it from ThML, GBF or PlainText,
> and treating TEI as if it were OSIS.) This is then transformed with xsl into
> xhtml and displayed in a "browser".
> However the lucene search highlight require
> return of content (rather then the key field) for intercepting the
> analyzer (token stream) to insert some tag/marking.
> I've been lurking on lucene-dev for a while and it is my impression that in
> order to highlight the text the analysis is redone on the search result.
> Given that we index the module in its format but transform it stepwise into
> a display format, it is not the original format that needs to be analyzed
> but rather the intermediate OSIS or final HTML.
> If I understand Lucene highlighting correctly, Lucene uses start and offset
> of tokens to do the markup. I don't think JSword stores this info in the
> module. It certainly wouldn't be useful. Right now our analyzers are fed
> plain text, they don't parse xml. And to store notes, headings, .... into
> fields, we get these as plain text, too.

Yes, I guess it is the problem :)

I will look at it on how to implement that lucene highlight into
jsword cleanly, but I am not sure whether I will be succed or not.
If I am able to implement it, I will submit the patch to jsword,
otherwise I am not able to tackle the problem :)

> I don't have a problem with changing the API when it is appropriate. Now
> that Alkitab, FireBible and BibleDesktop are the declared users of it, I
> think we need to get agreement among ourselves how to move make such
> changes. While anyone is free to use JSword for their own uses, my feeling
> is that, unless there is declared usage of JSword that is visible to the
> people on this list, they are left out of the discussion.

Yes, this is important because there are some project which depend on
jsword. I will try as much as not changing the API, however if needed
the API will change but I try to keep minimal as possible.

note: do you have ant build script that build the things locally and
simple (only the application), currently the build script is looking
at a lot of places, including the web section

> This is the method in o.c.j.index.search.Searcher:
> Key search(SearchRequest request) throws BookException;
> Feel free to deprecate this and add something like:
> SearchResult getSearchResult(SearchRequest request) throws BookException.
> Or change it to
> SearchResult search(SearchRequest request) throws BookException;
> making SearchResult extend Key;
> A similar change would need to be done in o.c.j.index.query.Query.
> This would have been a better way, in the first place.

Yes, I will try to follow your idea here.

> My suggestion is to evaluate a Lucene solution, by writing a standalone
> Lucene highlighter against the index, using the JSword library to see what
> it takes and whether any changes need to be done to the index creation to
> get it to work (like storing tokens with positional info.)
> If we do need to change the index, we'll need to complete the versioning of
> the index so that we can notify the user that the index needs to be
> re-built.

Yes, this is good idea to use follow the lucene, rather reinvent the wheel.

Tonny Kohar
Alkitab Bible Study

More information about the jsword-devel mailing list