[jsword-devel] Search Highlight

DM Smith dmsmith at crosswire.org
Sat Apr 25 04:29:47 MST 2009


On Apr 25, 2009, at 1:06 AM, Tonny Kohar wrote:

> Hi,
>
> On Fri, Apr 24, 2009 at 6:32 PM, DM Smith <dmsmith at crosswire.org>  
> wrote:
>>
>> On Apr 24, 2009, at 2:42 AM, Tonny Kohar wrote:
>>
>>> Hi,
>>>
>>> Is there any plan to implement lucene search highlight for the  
>>> JSword
>>> engine/API ?
>>
>> Yes. See:
>> http://crosswire.org/bugs/browse/BD-27
>>
>> I have not bothered to work on it or to look at it. I welcome  
>> someone to
>> pick this up.
>>
>
> I just look at the package org.crosswire.jsword.index (and its
> subpackage lucene, query, search), the implementation is returning Key
> for the search query.

Now you see why I haven't tackled it;) Maybe why others have not either.

Yes, the way search works in JSword is that the Lucene index is  
searched and from each Lucene Document (i.e. a verse) the reference is  
retrieved and is converted into a JSword Key. The result is a list of  
hits. The list is then used to retrieve raw text from the module. This  
then is assembled into an xml OSIS document (possibly by transforming  
it from ThML, GBF or PlainText, and treating TEI as if it were OSIS.)  
This is then transformed with xsl into xhtml and displayed in a  
"browser".


> However the lucene search highlight require
> return of content (rather then the key field) for intercepting the
> analyzer (token stream) to insert some tag/marking.

I've been lurking on lucene-dev for a while and it is my impression  
that in order to highlight the text the analysis is redone on the  
search result. Given that we index the module in its format but  
transform it stepwise into a display format, it is not the original  
format that needs to be analyzed but rather the intermediate OSIS or  
final HTML.

If I understand Lucene highlighting correctly, Lucene uses start and  
offset of tokens to do the markup. I don't think JSword stores this  
info in the module. It certainly wouldn't be useful. Right now our  
analyzers are fed plain text, they don't parse xml. And to store  
notes, headings, .... into fields, we get these as plain text, too.

>
>
> So I quess the only way to implement search highlight for JSword
> without API changes

I don't have a problem with changing the API when it is appropriate.  
Now that Alkitab, FireBible and BibleDesktop are the declared users of  
it, I think we need to get agreement among ourselves how to move make  
such changes. While anyone is free to use JSword for their own uses,  
my feeling is that, unless there is declared usage of JSword that is  
visible to the people on this list, they are left out of the discussion.

This is the method in o.c.j.index.search.Searcher:
Key search(SearchRequest request) throws BookException;

Feel free to deprecate this and add something like:
SearchResult getSearchResult(SearchRequest request) throws  
BookException.

Or change it to
SearchResult search(SearchRequest request) throws BookException;
making SearchResult extend Key;

A similar change would need to be done in o.c.j.index.query.Query.

This would have been a better way, in the first place.


> is parsing the query string itself and maybe using
> regex to insert the tagging, is this correct ?

The problem with converting the query into a regex is getting it right  
or right enough. The difficulty is pairing Lucene search syntax into  
regex syntax. How do you do things like "Con* AND NOT Convert"? This  
probably would ignore the AND NOT.

As I noted above, I think Lucene has to re-analyze the query and the  
text all over again. I'd rather Lucene does it rather than doing it an  
alternate way. As Lucene core changes, it's highlighter changes to  
keep up. If we had our own way, then changes to Lucene might break it.

The other thing is that languages like Thai and Chinese need custom  
parsing to determine word boundaries. No sense in reduplicating the  
effort.


> Or do you have any
> other idea ?

My suggestion is to evaluate a Lucene solution, by writing a  
standalone Lucene highlighter against the index, using the JSword  
library to see what it takes and whether any changes need to be done  
to the index creation to get it to work (like storing tokens with  
positional info.)

If we do need to change the index, we'll need to complete the  
versioning of the index so that we can notify the user that the index  
needs to be re-built.

>
>
> If parsing the query string itself, do you have any idea of library
> that able to do that ?

We parse the query string using the interface:  
o.c.j.index.query.QueryBuilder.

The concrete implementation of this is  
o.c.j.index.Lucene.LuceneQueryBuilder. In there you can see that we  
split the query for ranges and for blurring, but other than that we  
allow Lucene to do it's own parsing.

That said, I'll accept pretty much anything you decide to do.

Working together to advance His Word,
	DM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20090425/cc3c09ef/attachment.html>


More information about the jsword-devel mailing list