<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Apr 25, 2009, at 1:06 AM, Tonny Kohar wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Hi,<br><br>On Fri, Apr 24, 2009 at 6:32 PM, DM Smith <<a href="mailto:dmsmith@crosswire.org">dmsmith@crosswire.org</a>> wrote:<br><blockquote type="cite"><br></blockquote><blockquote type="cite">On Apr 24, 2009, at 2:42 AM, Tonny Kohar wrote:<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite"><blockquote type="cite">Hi,<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">Is there any plan to implement lucene search highlight for the JSword<br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite">engine/API ?<br></blockquote></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Yes. See:<br></blockquote><blockquote type="cite"><a href="http://crosswire.org/bugs/browse/BD-27">http://crosswire.org/bugs/browse/BD-27</a><br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">I have not bothered to work on it or to look at it. I welcome someone to<br></blockquote><blockquote type="cite">pick this up.<br></blockquote><blockquote type="cite"><br></blockquote><br>I just look at the package org.crosswire.jsword.index (and its<br>subpackage lucene, query, search), the implementation is returning Key<br>for the search query.</div></blockquote><div><br></div>Now you see why I haven't tackled it;) Maybe why others have not either.<br><div><br></div>Yes, the way search works in JSword is that the Lucene index is searched and from each Lucene Document (i.e. a verse) the reference is retrieved and is converted into a JSword Key. The result is a list of hits. The list is then used to retrieve raw text from the module. This then is assembled into an xml OSIS document (possibly by transforming it from ThML, GBF or PlainText, and treating TEI as if it were OSIS.) This is then transformed with xsl into xhtml and displayed in a "browser".</div><div><br></div><div><br><blockquote type="cite"><div> However the lucene search highlight require<br>return of content (rather then the key field) for intercepting the<br>analyzer (token stream) to insert some tag/marking.</div></blockquote><div><br></div>I've been lurking on lucene-dev for a while and it is my impression that in order to highlight the text the analysis is redone on the search result. Given that we index the module in its format but transform it stepwise into a display format, it is not the original format that needs to be analyzed but rather the intermediate OSIS or final HTML.</div><div><br></div><div>If I understand Lucene highlighting correctly, Lucene uses start and offset of tokens to do the markup. I don't think JSword stores this info in the module. It certainly wouldn't be useful. Right now our analyzers are fed plain text, they don't parse xml. And to store notes, headings, .... into fields, we get these as plain text, too.</div><div><br><blockquote type="cite"><div><br><br>So I quess the only way to implement search highlight for JSword<br>without API changes</div></blockquote><div><br></div>I don't have a problem with changing the API when it is appropriate. Now that Alkitab, FireBible and BibleDesktop are the declared users of it, I think we need to get agreement among ourselves how to move make such changes. While anyone is free to use JSword for their own uses, my feeling is that, unless there is declared usage of JSword that is visible to the people on this list, they are left out of the discussion. <br><div><br></div><div>This is the method in o.c.j.index.search.Searcher:</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; ">Key search(SearchRequest request) <span style="color: rgb(127, 0, 85); ">throws</span> BookException;</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; "><br></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; ">Feel free to deprecate this and add something like:</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; ">SearchResult getSearchResult(SearchRequest request) throws BookException.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; "><br></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; ">Or change it to</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; ">SearchResult search(SearchRequest request) throws BookException;</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; ">making SearchResult extend Key;</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; "><br></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; ">A similar change would need to be done in o.c.j.index.query.Query.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; "><br></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; ">This would have been a better way, in the first place.</div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; "><br></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; "><br></div><blockquote type="cite"><div> is parsing the query string itself and maybe using<br>regex to insert the tagging, is this correct ? </div></blockquote><div><br></div>The problem with converting the query into a regex is getting it right or right enough. The difficulty is pairing Lucene search syntax into regex syntax. How do you do things like "Con* AND NOT Convert"? This probably would ignore the AND NOT.<br><div><br></div>As I noted above, I think Lucene has to re-analyze the query and the text all over again. I'd rather Lucene does it rather than doing it an alternate way. As Lucene core changes, it's highlighter changes to keep up. If we had our own way, then changes to Lucene might break it.</div><div><br></div><div>The other thing is that languages like Thai and Chinese need custom parsing to determine word boundaries. No sense in reduplicating the effort.</div><div><br><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 11px/normal Monaco; "><br></div><blockquote type="cite"><div>Or do you have any<br>other idea ?</div></blockquote><div><br></div>My suggestion is to evaluate a Lucene solution, by writing a standalone Lucene highlighter against the index, using the JSword library to see what it takes and whether any changes need to be done to the index creation to get it to work (like storing tokens with positional info.)</div><div><br></div><div>If we do need to change the index, we'll need to complete the versioning of the index so that we can notify the user that the index needs to be re-built.</div><div><br><blockquote type="cite"><div><br><br>If parsing the query string itself, do you have any idea of library<br>that able to do that ?</div></blockquote><div><br></div>We parse the query string using the interface: o.c.j.index.query.QueryBuilder.</div><div><br></div><div>The concrete implementation of this is o.c.j.index.Lucene.LuceneQueryBuilder. In there you can see that we split the query for ranges and for blurring, but other than that we allow Lucene to do it's own parsing.</div><div><br></div><div>That said, I'll accept pretty much anything you decide to do.</div><div><br></div><div>Working together to advance His Word,</div><div><span class="Apple-tab-span" style="white-space:pre">        </span>DM<br></div></body></html>