[jsword-devel] Extending Lucene Indexes and stemming in particular

Sijo Cherian sijo.cherian at gmail.com
Fri Apr 18 22:12:11 MST 2014


Great discussion. isProgress.

I am still pondering all the benefits of double indexing the entire content.

For specialized users, who don't want stemming factor in their searching:
Can we provide a API for them to specify param like noStemming, noLowercase
etc at the time of indexing on per-book basis, and persist those metadata
in property file. Use exact  property during query analysis. These users
probably won't want auto-reindexing on major jsword upgrade.

Easter is almost here!
-sijo
On Thu, Apr 17, 2014 at 8:40 PM, DM Smith <dmsmith at crosswire.org> wrote:

>
> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>
> Hello
>
> STEP uses stemming to improve search results, in some queries (whether on
> Sword modules or otherwise).
>
>
> Stemming is very useful. On occasion, there is a need for a non-stemmed
> search. Especially for theological purposes. But for general purpose
> searching it should be the default.
>
> I've some times thought it'd be good to double index: stemmed and full
> word.
>
>
> There are currently 2 limitations in JSword, both of which could easily be
> fixed. Please let me know if you have concerns around me implementing both.
>
> a- the frontend can't extend/control the use of indexes. I'm suggesting we
> add a registerFieldIndexer(fieldIndexer) with a simple interface:
> indexField(doc, osis). This would allow frontends to specify its own
> indexing. This would allow a frontend to index new things, or enable term
> vectors / store fields, etc.
>
>
> I'd really rather that we didn't go down this route. I don't mind plugin
> architecture as a way to experiment with different techniques, but I'd
> really rather that we all benefit from the changes.
>
>
> b- Extend the LuceneIndex to have a stemmed version of the heading. We
> could replace the existing index, but that would mean all frontends will
> require re-indexing.
>
>
> I think the same manner that we index the main verse text should be
> applied to all text: intro, heading and verse text.
>
>
> c- Had JSword been configured to 'STORE' the content of some fields, I
> would have used that for headings. For example, if the headings is stored
> in the index, STEP would not need to do an osis extract and XML transform
> to display to the user. It could come straight from the index. Two
> possibilities here: change the existing index field configuration, or
> duplicate into a different field.
>
>
> I think we should make store an option, possibly the standard.
>
> Right now the way we do the index prevents us from using Lucene to
> highlight the search hit. If that is STORE, then I'd be in favor of making
> STORE standard. I wonder if our stripping the text to no include OSIS
> before indexing will frustrate this change.
>
> It still should be an option for the sake of devices that are disk limited.
>
> d- the other side of c- is that ideally multiple headings should be stored
> in multiple entries to the same field, rather than a concatenation of the
> field (doesn't much matter if it's only ANALYZED)
>
>
> Some verses have headings in the middle of the verse. Don't make the
> mistake of assuming an order of heading. Or that heading contains only
> pre-verse material or all pre-verse material.
>
>
> *I only need one of a- or b- to be able to progress. Happy to do either. I
> don't need c- because I've worked around, but it would have been nice to
> have some control over that. *
>
> pros & cons:
> a- more extensible in the future, other frontends don't benefit from
> enhancements
> b- solves an immediate problem, but impacts all frontends (i.e. space used
> in index).
>
> The only other bit in my mind is whether we need to ensure
> index-cross-application compatibility. I suspect some of this will tie in
> with the good work that Sijo has done on index management.
>
>
> The index management will be more critical with such a change. I've talked
> about having a manifest which defines the characteristics of the index. If
> we share an index created by two different systems, it will be important to
> "know" what an index supports.
>
> One of the changes that is being worked on is the update to a more recent
> version of Lucene. This affects how stemming is done. The way we are doing
> it now is deprecated and dropped.
>
>
> Let me know what your preferences are.
>
>
> Progress not perfection. Shared, configurable changes.
>
> Chris
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>


-- 
Regards,
Sijo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140419/3902222d/attachment.html>


More information about the jsword-devel mailing list