[jsword-devel] Extending Lucene Indexes and stemming in particular

Chris Burrell chris at burrell.me.uk
Sat Apr 19 01:14:49 MST 2014


I don't mind configuration so long as these indexes are stored separately
per app.

STEP relies on stemming and in places it uses it, we can't ask the user,
nor does it make sense there. So things would break and be quite hard to
debug.
Chris
On 19 Apr 2014 06:13, "Sijo Cherian" <sijo.cherian at gmail.com> wrote:

>
> Great discussion. isProgress.
>
> I am still pondering all the benefits of double indexing the entire
> content.
>
> For specialized users, who don't want stemming factor in their searching:
> Can we provide a API for them to specify param like noStemming, noLowercase
> etc at the time of indexing on per-book basis, and persist those metadata
> in property file. Use exact  property during query analysis. These users
> probably won't want auto-reindexing on major jsword upgrade.
>
> Easter is almost here!
> -sijo
> On Thu, Apr 17, 2014 at 8:40 PM, DM Smith <dmsmith at crosswire.org> wrote:
>
>>
>> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>>
>> Hello
>>
>> STEP uses stemming to improve search results, in some queries (whether on
>> Sword modules or otherwise).
>>
>>
>> Stemming is very useful. On occasion, there is a need for a non-stemmed
>> search. Especially for theological purposes. But for general purpose
>> searching it should be the default.
>>
>> I've some times thought it'd be good to double index: stemmed and full
>> word.
>>
>>
>> There are currently 2 limitations in JSword, both of which could easily
>> be fixed. Please let me know if you have concerns around me implementing
>> both.
>>
>> a- the frontend can't extend/control the use of indexes. I'm suggesting
>> we add a registerFieldIndexer(fieldIndexer) with a simple interface:
>> indexField(doc, osis). This would allow frontends to specify its own
>> indexing. This would allow a frontend to index new things, or enable term
>> vectors / store fields, etc.
>>
>>
>> I'd really rather that we didn't go down this route. I don't mind plugin
>> architecture as a way to experiment with different techniques, but I'd
>> really rather that we all benefit from the changes.
>>
>>
>> b- Extend the LuceneIndex to have a stemmed version of the heading. We
>> could replace the existing index, but that would mean all frontends will
>> require re-indexing.
>>
>>
>> I think the same manner that we index the main verse text should be
>> applied to all text: intro, heading and verse text.
>>
>>
>> c- Had JSword been configured to 'STORE' the content of some fields, I
>> would have used that for headings. For example, if the headings is stored
>> in the index, STEP would not need to do an osis extract and XML transform
>> to display to the user. It could come straight from the index. Two
>> possibilities here: change the existing index field configuration, or
>> duplicate into a different field.
>>
>>
>> I think we should make store an option, possibly the standard.
>>
>> Right now the way we do the index prevents us from using Lucene to
>> highlight the search hit. If that is STORE, then I'd be in favor of making
>> STORE standard. I wonder if our stripping the text to no include OSIS
>> before indexing will frustrate this change.
>>
>> It still should be an option for the sake of devices that are disk
>> limited.
>>
>> d- the other side of c- is that ideally multiple headings should be
>> stored in multiple entries to the same field, rather than a concatenation
>> of the field (doesn't much matter if it's only ANALYZED)
>>
>>
>> Some verses have headings in the middle of the verse. Don't make the
>> mistake of assuming an order of heading. Or that heading contains only
>> pre-verse material or all pre-verse material.
>>
>>
>> *I only need one of a- or b- to be able to progress. Happy to do either.
>> I don't need c- because I've worked around, but it would have been nice to
>> have some control over that. *
>>
>> pros & cons:
>> a- more extensible in the future, other frontends don't benefit from
>> enhancements
>> b- solves an immediate problem, but impacts all frontends (i.e. space
>> used in index).
>>
>> The only other bit in my mind is whether we need to ensure
>> index-cross-application compatibility. I suspect some of this will tie in
>> with the good work that Sijo has done on index management.
>>
>>
>> The index management will be more critical with such a change. I've
>> talked about having a manifest which defines the characteristics of the
>> index. If we share an index created by two different systems, it will be
>> important to "know" what an index supports.
>>
>> One of the changes that is being worked on is the update to a more recent
>> version of Lucene. This affects how stemming is done. The way we are doing
>> it now is deprecated and dropped.
>>
>>
>> Let me know what your preferences are.
>>
>>
>> Progress not perfection. Shared, configurable changes.
>>
>> Chris
>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>>
>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>>
>
>
> --
> Regards,
> Sijo
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140419/b8353c3a/attachment-0001.html>


More information about the jsword-devel mailing list