[jsword-devel] Extending Lucene Indexes and stemming in particular

Fri Apr 18 13:50:34 MST 2014

Sorry, are you saying that the changes will break all previously generated
indexes?  This will be a problem.

Martin

On 18 April 2014 21:05, Chris Burrell <chris at burrell.me.uk> wrote:

> Hi DM
>
> *1- Stemming*
> Yes, I was expecting the stem to give a hit, but we found it matched more
> words than we were wanting. I can't think of an example off the top of my
> head. The other thing we found is that you can't share the same field
> because stems aren't always prefixes.
>
> (For example, there are two PorterStemmers available in Lucene 3 / JSword
> classpath at the moment - one of them, can't remember which, gives a stem
> for genealogy to be genealogi - the other gives the stem as genealog).
>
> So for highlighting, you definitely would need to use Lucene, and I'm not
> entirely sure how well it would cope
>
> In STEP we use it for various things, most of which are related to find a
> 'topic' or for identifying 'meanings' of words, rather than for actual word
> searches. When a user picks a word, they want that word. But we allow
> searching for 'love' as a topic, using Naves, or as a word, looking through
> a lexicon for all entries matching the stem.
>
> *2- Segregating apps*
> I think for this, we would want to allow a frontend to register it's name
> (prefix and name?). This would allow us to create indexes such as esv-bd,
> esv-ab, esv-step, etc. It would also allow for application specific sidecar
> configurations.  The logic would then go app-specific, jsword-specific,
> sword-specific.
>
> Chris
>
>
>
> On 18 April 2014 20:47, DM Smith <dmsmith at crosswire.org> wrote:
>
>>
>> On Apr 18, 2014, at 4:23 AM, Chris Burrell <chris at burrell.me.uk> wrote:
>>
>>
>>
>>
>> On 18 April 2014 01:40, DM Smith <dmsmith at crosswire.org> wrote:
>>
>>>
>>> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>>>
>>> Hello
>>>
>>> STEP uses stemming to improve search results, in some queries (whether
>>> on Sword modules or otherwise).
>>>
>>>
>>> Stemming is very useful. On occasion, there is a need for a non-stemmed
>>> search. Especially for theological purposes. But for general purpose
>>> searching it should be the default.
>>>
>>> Are you suggesting we have 'heading' being the stemmed search and
>> fullHeading (or something like that) being the non-stemmed? I do think that
>> by default however, we should have the normal search. We experimented with
>> stemming in STEP by default and it can be quite confusing to look for a
>> particular word and hit others. Stemming doesn't always work the way you
>> expect.
>>
>>
>> I guess I'm confused by your previous comment. I thought you were
>> expecting the stem to give a hit.
>>
>> I think it is confusing because the 'hit' is not highlighted. If a stem
>> is highlighted then the user can quickly see and determine that it was
>> something they didn't want.
>>
>> Personally, I don't like stemming because I'm looking for a certain word
>> not heuristic variations of the word. Also, I don't like dropping stop
>> words (aka noise words) as many of them are theological significant (e.g.
>> in Christ).
>>
>>
>>
>>
>>
>>> I've some times thought it'd be good to double index: stemmed and full
>>> word.
>>>
>>> Double indexing is a need if you want both. The stem for genealogy
>> resolves to genealogi (because of the plurals) which is why my search
>> wasn't hit. We can't use the same field.
>>
>>
>>>
>>> There are currently 2 limitations in JSword, both of which could easily
>>> be fixed. Please let me know if you have concerns around me implementing
>>> both.
>>>
>>> a- the frontend can't extend/control the use of indexes. I'm suggesting
>>> we add a registerFieldIndexer(fieldIndexer) with a simple interface:
>>> indexField(doc, osis). This would allow frontends to specify its own
>>> indexing. This would allow a frontend to index new things, or enable term
>>> vectors / store fields, etc.
>>>
>>>
>>> I'd really rather that we didn't go down this route. I don't mind plugin
>>> architecture as a way to experiment with different techniques, but I'd
>>> really rather that we all benefit from the changes.
>>>
>>> Fine.
>>
>>
>>>
>>> b- Extend the LuceneIndex to have a stemmed version of the heading. We
>>> could replace the existing index, but that would mean all frontends will
>>> require re-indexing.
>>>
>>>
>>> I think the same manner that we index the main verse text should be
>>> applied to all text: intro, heading and verse text.
>>>
>>> Happy to do the change for all three.
>>
>>
>> For Bible Desktop, we'll have to force re-indexing anyway. I'm finding
>> that the old indexes don't work with the current code. I'm looking forward
>> to using Sijo's code to fix this.
>>
>>
>>
>>>
>>> c- Had JSword been configured to 'STORE' the content of some fields, I
>>> would have used that for headings. For example, if the headings is stored
>>> in the index, STEP would not need to do an osis extract and XML transform
>>> to display to the user. It could come straight from the index. Two
>>> possibilities here: change the existing index field configuration, or
>>> duplicate into a different field.
>>>
>>>
>>> I think we should make store an option, possibly the standard.
>>>
>> What I don't want to happen is end up in a situation where the Index is
>> shared in different configurations by different apps. That would break the
>> frontend.
>>
>>
>> Yep. If we can agree on what and how, that'd be best.
>>
>> Even if you can ask, 'do you support', that's unnecessary complexity,
>> that means that a user will have to re-index each book he has to support
>> different front-ends. It also means that if a frontend forgets to ask
>> whether some fields are indexed in a particular way, then he's going to
>> have broken functionality in the frontend due to another frontend
>> overriding the defaults. At this stage, I'd rather have app-specific
>> indices.
>>
>>
>>
>>>
>>> Right now the way we do the index prevents us from using Lucene to
>>> highlight the search hit. If that is STORE, then I'd be in favor of making
>>> STORE standard. I wonder if our stripping the text to no include OSIS
>>> before indexing will frustrate this change.
>>>
>>> Store is a requirement for highlighting (
>> http://lucene.472066.n3.nabble.com/Highlighting-for-non-stored-fields-td1773015.htmland
>> http://wiki.apache.org/lucene-java/LuceneFAQ).
>>
>>
>> It still should be an option for the sake of devices that are disk
>>> limited.
>>>
>>> d- the other side of c- is that ideally multiple headings should be
>>> stored in multiple entries to the same field, rather than a concatenation
>>> of the field (doesn't much matter if it's only ANALYZED)
>>>
>>>
>>> Some verses have headings in the middle of the verse. Don't make the
>>> mistake of assuming an order of heading. Or that heading contains only
>>> pre-verse material or all pre-verse material.
>>>
>>> I'm not making that mistake... All I'm saying is that headings should be
>> stored in different entries in the same field.
>> doc.add(fieldName, heading1);
>> doc.add(fieldName, heading2);
>> doc.add(fieldName, heading3);
>>
>> This means that you could retrieve one of the headings you want, rather
>> than all. i.e. Psalm 3.1 Non-canon-heading Canon-heading could have 3
>> separate fields.
>>
>>
>> This would be a good change.
>>
>>
>>
>>>
>>> *I only need one of a- or b- to be able to progress. Happy to do either.
>>> I don't need c- because I've worked around, but it would have been nice to
>>> have some control over that. *
>>>
>>> pros & cons:
>>> a- more extensible in the future, other frontends don't benefit from
>>> enhancements
>>> b- solves an immediate problem, but impacts all frontends (i.e. space
>>> used in index).
>>>
>>> The only other bit in my mind is whether we need to ensure
>>> index-cross-application compatibility. I suspect some of this will tie in
>>> with the good work that Sijo has done on index management.
>>>
>>>
>>> The index management will be more critical with such a change. I've
>>> talked about having a manifest which defines the characteristics of the
>>> index. If we share an index created by two different systems, it will be
>>> important to "know" what an index supports.
>>>
>>> as described above, I'd like to avoid this. I don't think a frontend
>> should have to worry about other frontends 'corrupting' the index (i.e.
>> redefining fields, changing the store status, etc.). I'd rather my own
>> index at that point.
>>
>>
>>
>>> One of the changes that is being worked on is the update to a more
>>> recent version of Lucene. This affects how stemming is done. The way we are
>>> doing it now is deprecated and dropped.
>>>
>>>
>>> Let me know what your preferences are.
>>>
>>>
>>> Progress not perfection. Shared, configurable changes.
>>>
>>> Chris
>>>
>>> _______________________________________________
>>> jsword-devel mailing list
>>> jsword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>
>>>
>>>
>>
>>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140418/662b8bde/attachment.html>