[jsword-devel] Extending Lucene Indexes and stemming in particular

Sijo Cherian sijo.cherian at gmail.com
Sat Apr 19 19:09:43 MST 2014


Chris,
Since we already have a language based Analyzer configuration, if you can
provide a custom jsword/src/main/resources/AnalyzerFactory.properties in
STEP and add custom config for english like this:

en.Analyzer=org.crosswire.jsword.index.lucene.analysis.ConfigurableSnowballAnalyzer

This will stem the "content" field, both during indexing & query. Can you
override prop files in your classpath, easily?

Regarding your requirement to stem the heading: Since the current impl for
"heading" uses the default analyzer, you will have to change prop
"Default.Analyzer" to snowball, but that will have bigger impact - uses
snowball for all other fields.




On Sat, Apr 19, 2014 at 4:14 AM, Chris Burrell <chris at burrell.me.uk> wrote:

> I don't mind configuration so long as these indexes are stored separately
> per app.
>
> STEP relies on stemming and in places it uses it, we can't ask the user,
> nor does it make sense there. So things would break and be quite hard to
> debug.
> Chris
> On 19 Apr 2014 06:13, "Sijo Cherian" <sijo.cherian at gmail.com> wrote:
>
>>
>> Great discussion. isProgress.
>>
>> I am still pondering all the benefits of double indexing the entire
>> content.
>>
>> For specialized users, who don't want stemming factor in their searching:
>> Can we provide a API for them to specify param like noStemming, noLowercase
>> etc at the time of indexing on per-book basis, and persist those metadata
>> in property file. Use exact  property during query analysis. These users
>> probably won't want auto-reindexing on major jsword upgrade.
>>
>> Easter is almost here!
>> -sijo
>> On Thu, Apr 17, 2014 at 8:40 PM, DM Smith <dmsmith at crosswire.org> wrote:
>>
>>>
>>> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>>>
>>> Hello
>>>
>>> STEP uses stemming to improve search results, in some queries (whether
>>> on Sword modules or otherwise).
>>>
>>>
>>> Stemming is very useful. On occasion, there is a need for a non-stemmed
>>> search. Especially for theological purposes. But for general purpose
>>> searching it should be the default.
>>>
>>> I've some times thought it'd be good to double index: stemmed and full
>>> word.
>>>
>>>
>>> There are currently 2 limitations in JSword, both of which could easily
>>> be fixed. Please let me know if you have concerns around me implementing
>>> both.
>>>
>>> a- the frontend can't extend/control the use of indexes. I'm suggesting
>>> we add a registerFieldIndexer(fieldIndexer) with a simple interface:
>>> indexField(doc, osis). This would allow frontends to specify its own
>>> indexing. This would allow a frontend to index new things, or enable term
>>> vectors / store fields, etc.
>>>
>>>
>>> I'd really rather that we didn't go down this route. I don't mind plugin
>>> architecture as a way to experiment with different techniques, but I'd
>>> really rather that we all benefit from the changes.
>>>
>>>
>>> b- Extend the LuceneIndex to have a stemmed version of the heading. We
>>> could replace the existing index, but that would mean all frontends will
>>> require re-indexing.
>>>
>>>
>>> I think the same manner that we index the main verse text should be
>>> applied to all text: intro, heading and verse text.
>>>
>>>
>>> c- Had JSword been configured to 'STORE' the content of some fields, I
>>> would have used that for headings. For example, if the headings is stored
>>> in the index, STEP would not need to do an osis extract and XML transform
>>> to display to the user. It could come straight from the index. Two
>>> possibilities here: change the existing index field configuration, or
>>> duplicate into a different field.
>>>
>>>
>>> I think we should make store an option, possibly the standard.
>>>
>>> Right now the way we do the index prevents us from using Lucene to
>>> highlight the search hit. If that is STORE, then I'd be in favor of making
>>> STORE standard. I wonder if our stripping the text to no include OSIS
>>> before indexing will frustrate this change.
>>>
>>> It still should be an option for the sake of devices that are disk
>>> limited.
>>>
>>> d- the other side of c- is that ideally multiple headings should be
>>> stored in multiple entries to the same field, rather than a concatenation
>>> of the field (doesn't much matter if it's only ANALYZED)
>>>
>>>
>>> Some verses have headings in the middle of the verse. Don't make the
>>> mistake of assuming an order of heading. Or that heading contains only
>>> pre-verse material or all pre-verse material.
>>>
>>>
>>> *I only need one of a- or b- to be able to progress. Happy to do either.
>>> I don't need c- because I've worked around, but it would have been nice to
>>> have some control over that. *
>>>
>>> pros & cons:
>>> a- more extensible in the future, other frontends don't benefit from
>>> enhancements
>>> b- solves an immediate problem, but impacts all frontends (i.e. space
>>> used in index).
>>>
>>> The only other bit in my mind is whether we need to ensure
>>> index-cross-application compatibility. I suspect some of this will tie in
>>> with the good work that Sijo has done on index management.
>>>
>>>
>>> The index management will be more critical with such a change. I've
>>> talked about having a manifest which defines the characteristics of the
>>> index. If we share an index created by two different systems, it will be
>>> important to "know" what an index supports.
>>>
>>> One of the changes that is being worked on is the update to a more
>>> recent version of Lucene. This affects how stemming is done. The way we are
>>> doing it now is deprecated and dropped.
>>>
>>>
>>> Let me know what your preferences are.
>>>
>>>
>>> Progress not perfection. Shared, configurable changes.
>>>
>>> Chris
>>>
>>> _______________________________________________
>>> jsword-devel mailing list
>>> jsword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>
>>>
>>>
>>> _______________________________________________
>>> jsword-devel mailing list
>>> jsword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>
>>>
>>
>>
>> --
>> Regards,
>> Sijo
>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>


-- 
Regards,
Sijo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140419/07137ca0/attachment.html>


More information about the jsword-devel mailing list