[jsword-devel] Search Highlight
DM Smith
dmsmith at crosswire.org
Sat Apr 25 04:29:47 MST 2009
On Apr 25, 2009, at 1:06 AM, Tonny Kohar wrote:
> Hi,
>
> On Fri, Apr 24, 2009 at 6:32 PM, DM Smith <dmsmith at crosswire.org>
> wrote:
>>
>> On Apr 24, 2009, at 2:42 AM, Tonny Kohar wrote:
>>
>>> Hi,
>>>
>>> Is there any plan to implement lucene search highlight for the
>>> JSword
>>> engine/API ?
>>
>> Yes. See:
>> http://crosswire.org/bugs/browse/BD-27
>>
>> I have not bothered to work on it or to look at it. I welcome
>> someone to
>> pick this up.
>>
>
> I just look at the package org.crosswire.jsword.index (and its
> subpackage lucene, query, search), the implementation is returning Key
> for the search query.
Now you see why I haven't tackled it;) Maybe why others have not either.
Yes, the way search works in JSword is that the Lucene index is
searched and from each Lucene Document (i.e. a verse) the reference is
retrieved and is converted into a JSword Key. The result is a list of
hits. The list is then used to retrieve raw text from the module. This
then is assembled into an xml OSIS document (possibly by transforming
it from ThML, GBF or PlainText, and treating TEI as if it were OSIS.)
This is then transformed with xsl into xhtml and displayed in a
"browser".
> However the lucene search highlight require
> return of content (rather then the key field) for intercepting the
> analyzer (token stream) to insert some tag/marking.
I've been lurking on lucene-dev for a while and it is my impression
that in order to highlight the text the analysis is redone on the
search result. Given that we index the module in its format but
transform it stepwise into a display format, it is not the original
format that needs to be analyzed but rather the intermediate OSIS or
final HTML.
If I understand Lucene highlighting correctly, Lucene uses start and
offset of tokens to do the markup. I don't think JSword stores this
info in the module. It certainly wouldn't be useful. Right now our
analyzers are fed plain text, they don't parse xml. And to store
notes, headings, .... into fields, we get these as plain text, too.
>
>
> So I quess the only way to implement search highlight for JSword
> without API changes
I don't have a problem with changing the API when it is appropriate.
Now that Alkitab, FireBible and BibleDesktop are the declared users of
it, I think we need to get agreement among ourselves how to move make
such changes. While anyone is free to use JSword for their own uses,
my feeling is that, unless there is declared usage of JSword that is
visible to the people on this list, they are left out of the discussion.
This is the method in o.c.j.index.search.Searcher:
Key search(SearchRequest request) throws BookException;
Feel free to deprecate this and add something like:
SearchResult getSearchResult(SearchRequest request) throws
BookException.
Or change it to
SearchResult search(SearchRequest request) throws BookException;
making SearchResult extend Key;
A similar change would need to be done in o.c.j.index.query.Query.
This would have been a better way, in the first place.
> is parsing the query string itself and maybe using
> regex to insert the tagging, is this correct ?
The problem with converting the query into a regex is getting it right
or right enough. The difficulty is pairing Lucene search syntax into
regex syntax. How do you do things like "Con* AND NOT Convert"? This
probably would ignore the AND NOT.
As I noted above, I think Lucene has to re-analyze the query and the
text all over again. I'd rather Lucene does it rather than doing it an
alternate way. As Lucene core changes, it's highlighter changes to
keep up. If we had our own way, then changes to Lucene might break it.
The other thing is that languages like Thai and Chinese need custom
parsing to determine word boundaries. No sense in reduplicating the
effort.
> Or do you have any
> other idea ?
My suggestion is to evaluate a Lucene solution, by writing a
standalone Lucene highlighter against the index, using the JSword
library to see what it takes and whether any changes need to be done
to the index creation to get it to work (like storing tokens with
positional info.)
If we do need to change the index, we'll need to complete the
versioning of the index so that we can notify the user that the index
needs to be re-built.
>
>
> If parsing the query string itself, do you have any idea of library
> that able to do that ?
We parse the query string using the interface:
o.c.j.index.query.QueryBuilder.
The concrete implementation of this is
o.c.j.index.Lucene.LuceneQueryBuilder. In there you can see that we
split the query for ranges and for blurring, but other than that we
allow Lucene to do it's own parsing.
That said, I'll accept pretty much anything you decide to do.
Working together to advance His Word,
DM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20090425/cc3c09ef/attachment.html>
More information about the jsword-devel
mailing list