<div dir="ltr">Hi Martin<div><br></div><div>My immediate requirement is for 1 addition (heading stems) to the index. I could a new method in the IndexPolicyAdapter if you want. The others were suggested by DM. As I've said quite a few times now, I'd be happy to remove these, or also put these in the policy adapter. They are deliberately additions to ensure backwards compatibility of the index. (i.e. deliberately ensured that I would not break AB!)</div>
<div><br></div><div>As for the difference in requirements, it's not that far off. As you'll know from some emails in 2012, STEP will be available on mobile devices (as well as Desktops), so Sijo's code will come in useful for this (i.e. detecting that new indexes are required, downloading them, etc.)</div>
<div><br></div><div>Chris</div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 21 April 2014 12:11, Martin Denham <span dir="ltr"><<a href="mailto:mjdenham@gmail.com" target="_blank">mjdenham@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>I don't want to be seen as a 'stick-in-the-mud' regarding index improvements so could I emphasize that STEP and AB requirements are very different and I suppose most desktop apps like BD are probably somewhere in the middle:</div>
<div><br></div><div>AB</div><div>Indexes all over the world</div><div>Low powered devices</div><div>Need small indexes</div><div>Need to have fast index generation</div><div>Need low memory and storage requirements</div>
<div>
I have no direct access to devices </div><div>Users normally have no technical experience</div><div>Users very happy with current search functionality</div><div>Need backward compatibility</div><div>Download speed very slow in general 2G/3G and costs money</div>
<div>Frequent connection problems depending on country, provider</div><div><br></div><div>STEP</div><div>Indexes at single centralized location</div><div>High powered server</div><div>Index size not a factor</div><div>Index generation only occurs once</div>
<div>Lots of RAM and disk space available</div><div>Experienced dev-op (Chris)</div><div>Regeneration simple</div><div>Pressing for enhanced functionality</div><div>No need for backward compatibility</div><div>Instant access to server</div>
<div><br></div><div>I realise Sijo is implementing upgrade functionality but even then it will still not be a simple upgrade for AB given the architecture but STEP would not even need to use the new upgrade code.</div><div>
<br></div><div><b>Questions</b></div><div>I haven't followed all of the preceding discussion, partly because the finer details of Lucene have beaten me, so could I ask for clarification of some of the changes. There seem to be 3 changes being discussed:</div>
<div><br></div><div><i>New code to support index upgrades</i> (Sijo)</div><div>I understand most of this. It looks useful. I am hoping to submit a simple change/suggestion for the download index method compatible with this. Index upgrades should have a deprecation period when old indexes work but new indexes are available for download or generation.</div>
<div><br></div><div><i>Changes to the generated indexes to support different search methods</i></div><div>I got a bit lost in the detail here. Is this to allow enhanced STEP specific functionality or a required change for basic JSword searches. If it is for STEP could this be handled via IndexPolicyAdapter.</div>
<div><br></div><div><i>Upgrade of Lucene</i></div><div>I realise STEP and AB have different leanings on this because of different architectures. Which version of Lucene is it currently being planned to move to as various versions were discussed, some of which have a modified api and incompatible indexes, some don't. If the target version of Lucene is incompatible then DM's suggestion will hopefully work but will it be possible to isolate api differences sufficiently to use the plugin architecture.</div>
<div><br></div><div>Regards</div><span class="HOEnZb"><font color="#888888"><div>Martin</div><br></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On 21 April 2014 04:28, Sijo Cherian <span dir="ltr"><<a href="mailto:sijo.cherian@gmail.com" target="_blank">sijo.cherian@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Thanks DM for explaining this far. The plugin configuration is nice way for index customization.<br>
</div>As we extend our index fields, we should make it easier for the api user to see all index fields present, and analyzer used for each.<br>
<br>I am working on getting lucene upgrade functionality done.<br></div><div class="gmail_extra"><div><div><br><br><div class="gmail_quote">On Sun, Apr 20, 2014 at 3:22 PM, DM Smith <span dir="ltr"><<a href="mailto:dmsmith@crosswire.org" target="_blank">dmsmith@crosswire.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word">He is risen!<div><br></div><div>I haven't pulled the push request, atm. I think we need a bit more discussion. We are close.</div>
<div><br></div><div>Indexing/searching is specified via interface and implemented via plugins. The IndexManager.plugin, QueryBuilder.plugin, QueryDecorator.plugin and Searcher.plugin. AnalyzerFactory.properties that Sijo mentioned is also a critical part. There may be a few others.</div>
<div><br></div><div>There is no problem with AndBible having a different Index implementation (i.e. the current one) if we create the new one with a different name. AndBible will need to have a jar with the old implementation. JSword will provide the new implementation.</div>
<div><br></div><div>This plugin mechanism was provided to be able to swap out one implementation for another during development, but can serve this purpose well.<span><font color="#888888"><br><div><br></div>
</font></span><div><span><font color="#888888">DM</font></span><div><div><br><div><div><br></div><div>On Apr 20, 2014, at 3:39 AM, Chris Burrell <<a href="mailto:chris@burrell.me.uk" target="_blank">chris@burrell.me.uk</a>> wrote:</div>
<br><blockquote type="cite"><p dir="ltr">Hi Sijo</p><p dir="ltr">That wouldn't do what I want. I need the non stemmed body content and a separate stemmed heading field.</p><p dir="ltr">Even if I did want the stemmed body, I would want it in addition to the non stemmed body. </p>
<p dir="ltr">As I said, happy to remove the other ones. They were put in at DM s suggestion. </p><p dir="ltr">Chris<br>
</p>
<div class="gmail_quote">On 20 Apr 2014 03:09, "Sijo Cherian" <<a href="mailto:sijo.cherian@gmail.com" target="_blank">sijo.cherian@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><div>Chris,<br></div><div>Since we already have a language based Analyzer configuration, if you can provide a custom jsword/src/main/resources/AnalyzerFactory.properties in STEP and add custom config for english like this:<br>
<br>en.Analyzer=org.crosswire.jsword.index.lucene.analysis.ConfigurableSnowballAnalyzer<br><br>This will stem the "content" field, both during indexing & query. Can you override prop files in your classpath, easily?<br>
<br>Regarding your requirement to stem the heading: Since the current impl for "heading" uses the default analyzer, you will have to change prop "Default.Analyzer" to snowball, but that will have bigger impact - uses snowball for all other fields.<br>
<br></div><br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Apr 19, 2014 at 4:14 AM, Chris Burrell <span dir="ltr"><<a href="mailto:chris@burrell.me.uk" target="_blank">chris@burrell.me.uk</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir="ltr">I don't mind configuration so long as these indexes are stored separately per app.</p><p dir="ltr">STEP relies on stemming and in places it uses it, we can't ask the user, nor does it make sense there. So things would break and be quite hard to debug.<span><font color="#888888"><br>
Chris</font></span></p><div>
<div class="gmail_quote">On 19 Apr 2014 06:13, "Sijo Cherian" <<a href="mailto:sijo.cherian@gmail.com" target="_blank">sijo.cherian@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr"><div><div><br></div>Great discussion. isProgress.<br><br></div>I am still pondering all the benefits of double indexing the entire content.<br><br><div><div class="gmail_extra">For specialized users, who don't want stemming factor in their searching: Can we provide a API for them to specify param like noStemming, noLowercase etc at the time of indexing on per-book basis, and persist those metadata in property file. Use exact property during query analysis. These users probably won't want auto-reindexing on major jsword upgrade. </div>
<div class="gmail_extra"><br></div><div class="gmail_extra">Easter is almost here!<br></div><div class="gmail_extra">-sijo<br></div><div class="gmail_extra"><div class="gmail_quote">On Thu, Apr 17, 2014 at 8:40 PM, DM Smith <span dir="ltr"><<a href="mailto:dmsmith@crosswire.org" target="_blank">dmsmith@crosswire.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="word-wrap:break-word"><br><div><div><div>On Apr 17, 2014, at 12:09 PM, Chris Burrell <<a href="mailto:chris@burrell.me.uk" target="_blank">chris@burrell.me.uk</a>> wrote:</div>
<br></div><div><blockquote type="cite"><div dir="ltr">Hello<div><br></div><div>STEP uses stemming to improve search results, in some queries (whether on Sword modules or otherwise).</div></div></blockquote><div><br>
</div></div>Stemming is very useful. On occasion, there is a need for a non-stemmed search. Especially for theological purposes. But for general purpose searching it should be the default.</div><div><br></div><div>I've some times thought it'd be good to double index: stemmed and full word.</div>
<div><div><br><blockquote type="cite"><div dir="ltr"><div><br></div><div>There are currently 2 limitations in JSword, both of which could easily be fixed. Please let me know if you have concerns around me implementing both.</div>
<div><br></div><div>a- the frontend can't extend/control the use of indexes. I'm suggesting we add a registerFieldIndexer(fieldIndexer) with a simple interface: indexField(doc, osis). This would allow frontends to specify its own indexing. This would allow a frontend to index new things, or enable term vectors / store fields, etc. </div>
</div></blockquote><div><br></div></div>I'd really rather that we didn't go down this route. I don't mind plugin architecture as a way to experiment with different techniques, but I'd really rather that we all benefit from the changes.</div>
<div><br></div><div><div><blockquote type="cite"><div dir="ltr">
<div><br></div><div>b- Extend the LuceneIndex to have a stemmed version of the heading. We could replace the existing index, but that would mean all frontends will require re-indexing.</div></div></blockquote><div><br></div>
</div>I think the same manner that we index the main verse text should be applied to all text: intro, heading and verse text.</div><div><br></div><div><div><blockquote type="cite"><div dir="ltr"><div><br></div><div>
c- Had JSword been configured to 'STORE' the content of some fields, I would have used that for headings. For example, if the headings is stored in the index, STEP would not need to do an osis extract and XML transform to display to the user. It could come straight from the index. Two possibilities here: change the existing index field configuration, or duplicate into a different field.</div>
<div><br></div></div></blockquote><div><br></div></div>I think we should make store an option, possibly the standard.</div><div><br></div><div>Right now the way we do the index prevents us from using Lucene to highlight the search hit. If that is STORE, then I'd be in favor of making STORE standard. I wonder if our stripping the text to no include OSIS before indexing will frustrate this change.</div>
<div><br></div><div>It still should be an option for the sake of devices that are disk limited.</div><div><div><br><blockquote type="cite"><div dir="ltr">d- the other side of c- is that ideally multiple headings should be stored in multiple entries to the same field, rather than a concatenation of the field (doesn't much matter if it's only ANALYZED)</div>
</blockquote><div><br></div></div>Some verses have headings in the middle of the verse. Don't make the mistake of assuming an order of heading. Or that heading contains only pre-verse material or all pre-verse material.</div>
<div><div><br><blockquote type="cite"><div dir="ltr">
<div><br></div><div><b>I only need one of a- or b- to be able to progress. Happy to do either. I don't need c- because I've worked around, but it would have been nice to have some control over that. </b></div><div>
<br></div><div>pros & cons:</div><div>a- more extensible in the future, other frontends don't benefit from enhancements</div><div>b- solves an immediate problem, but impacts all frontends (i.e. space used in index).</div>
<div><br></div><div>The only other bit in my mind is whether we need to ensure index-cross-application compatibility. I suspect some of this will tie in with the good work that Sijo has done on index management.</div></div>
</blockquote><div><br></div></div>The index management will be more critical with such a change. I've talked about having a manifest which defines the characteristics of the index. If we share an index created by two different systems, it will be important to "know" what an index supports.</div>
<div><br></div><div>One of the changes that is being worked on is the update to a more recent version of Lucene. This affects how stemming is done. The way we are doing it now is deprecated and dropped.</div><div><div>
<br><blockquote type="cite"><div dir="ltr"><div>
<br></div><div>Let me know what your preferences are.</div></div></blockquote><div><br></div></div>Progress not perfection. Shared, configurable changes.</div><div><br><blockquote type="cite"><div dir="ltr"><div>Chris</div>
<div><br></div></div>
_______________________________________________<br>jsword-devel mailing list<br><a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br><a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
</blockquote></div><br></div><br>_______________________________________________<br>
jsword-devel mailing list<br>
<a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>Regards,<br>Sijo
</div></div></div>
<br>_______________________________________________<br>
jsword-devel mailing list<br>
<a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
<br></blockquote></div>
</div><br>_______________________________________________<br>
jsword-devel mailing list<br>
<a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
<br></blockquote></div><br><br clear="all"><br>-- <br>Regards,<br>Sijo
</div>
</blockquote></div>
_______________________________________________<br>jsword-devel mailing list<br><a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br><a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
</blockquote></div><br></div></div></div></div></div></blockquote></div><br><br clear="all"><br></div></div><span><font color="#888888">-- <br>Regards,<br>Sijo
</font></span></div>
<br>_______________________________________________<br>
jsword-devel mailing list<br>
<a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
<br></blockquote></div><br></div>
</div></div><br>_______________________________________________<br>
jsword-devel mailing list<br>
<a href="mailto:jsword-devel@crosswire.org">jsword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
<br></blockquote></div><br></div>