[jsword-devel] Extending Lucene Indexes and stemming in particular

Chris Burrell chris at burrell.me.uk
Thu Apr 17 09:09:05 MST 2014


Hello

STEP uses stemming to improve search results, in some queries (whether on
Sword modules or otherwise).

There are currently 2 limitations in JSword, both of which could easily be
fixed. Please let me know if you have concerns around me implementing both.

a- the frontend can't extend/control the use of indexes. I'm suggesting we
add a registerFieldIndexer(fieldIndexer) with a simple interface:
indexField(doc, osis). This would allow frontends to specify its own
indexing. This would allow a frontend to index new things, or enable term
vectors / store fields, etc.

b- Extend the LuceneIndex to have a stemmed version of the heading. We
could replace the existing index, but that would mean all frontends will
require re-indexing.

c- Had JSword been configured to 'STORE' the content of some fields, I
would have used that for headings. For example, if the headings is stored
in the index, STEP would not need to do an osis extract and XML transform
to display to the user. It could come straight from the index. Two
possibilities here: change the existing index field configuration, or
duplicate into a different field.

d- the other side of c- is that ideally multiple headings should be stored
in multiple entries to the same field, rather than a concatenation of the
field (doesn't much matter if it's only ANALYZED)

*I only need one of a- or b- to be able to progress. Happy to do either. I
don't need c- because I've worked around, but it would have been nice to
have some control over that. *

pros & cons:
a- more extensible in the future, other frontends don't benefit from
enhancements
b- solves an immediate problem, but impacts all frontends (i.e. space used
in index).

The only other bit in my mind is whether we need to ensure
index-cross-application compatibility. I suspect some of this will tie in
with the good work that Sijo has done on index management.

Let me know what your preferences are.
Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140417/2e916edf/attachment-0001.html>


More information about the jsword-devel mailing list