[jsword-devel] Search Highlight

Mon Apr 27 02:42:33 MST 2009

Hi,

On Sat, Apr 25, 2009 at 6:29 PM, DM Smith <dmsmith at crosswire.org> wrote:
> As I noted above, I think Lucene has to re-analyze the query and the text
> all over again. I'd rather Lucene does it rather than doing it an alternate
> way. As Lucene core changes, it's highlighter changes to keep up. If we had
> our own way, then changes to Lucene might break it.
> The other thing is that languages like Thai and Chinese need custom parsing
> to determine word boundaries. No sense in reduplicating the effort.
>
> Or do you have any
> other idea ?
>
> My suggestion is to evaluate a Lucene solution, by writing a standalone
> Lucene highlighter against the index, using the JSword library to see what
> it takes and whether any changes need to be done to the index creation to
> get it to work (like storing tokens with positional info.)
> If we do need to change the index, we'll need to complete the versioning of
> the index so that we can notify the user that the index needs to be
> re-built.

My initial finding seem there is no need for API change, what it needs
is simple a new package eg: o.c.jsword.index.highlight
and inside that package there is a static/factory/builder Highlight
class which accept either (raw text, OSIS xml, or html output).

Here is simple code which utilize Lucene Highlight (by wrap/tag the
output with <b>some text</b>)

public void testHightlight() {

String field = "runtime";

        Analyzer analyzer = new StandardAnalyzer();
        QueryParser parser = new QueryParser(field, analyzer);
        parser.setAllowLeadingWildcard(true);

        String text = "In the beginning God created the heaven and the earth. "
                    + "And the earth was without form, and void; and
darkness was upon the face of the deep. "
                    + "And the Spirit of God moved upon the face of
the waters. "
                    + "And God said, Let there be light: and there was light. "
                    + "And God saw the light, that it was good: and
God divided the light from the darkness. "
                    + "And God called the light Day, and the darkness
he called Night. "
                    + "And the evening and the morning were the first day";

        try {
            Query q = parser.parse(searchString);
            Highlighter highlighter = new Highlighter(new QueryScorer(q));

            TokenStream tokenstream = analyzer.tokenStream(field, new
StringReader(text));
            String summary = highlighter.getBestFragments(tokenstream,
text, 2, "...");
            System.out.println("summary : " + summary);

        } catch (Exception ex) {
            ex.printStackTrace();
        }
}

So the step will be like
- let the index/query things as it is, return list of key
- based on the keys, retrieve the either raw text/osis xml
- pass those raw text/osis xml into static highlight method
- then pass the output of those highlight methods into xslt
- then display as usual

If you are ok with this approach, then what it need to complete the things are
- maybe utilize existing jsword analyzer (LuceneAnalyzer) currently it
accept books, no problem I guess
- the text to be highlighted, the example above are using plain text,
So it need to be replaced with I dunno (what do you think is good
source eg: raw text, osis xml or html output), I prefer either raw
text/osis xml. this imput source then will be tagged with lucene
highlight then the xlst can do the work to transform it into eg: html
output
- then tokenStream need to replaced with something that able to
tokenize raw or xml.
- the multilanguage (chineses, thai, etc), I am not familiar with
this, do you have any idea what involved with those language.

Cheers
Tonny Kohar
--
Alkitab Bible Study
http://www.kiyut.com/products/alkitab/index.html