[jsword-devel] Extending Lucene Indexes and stemming in particular

Fri Apr 18 14:10:52 MST 2014

Apologies, I hadn't read your comment on the pull request before I saw DM's
comment above "For Bible Desktop, we'll have to force re-indexing anyway.
I'm finding that the old indexes don't work with the current code. I'm
looking forward to using Sijo's code to fix this."

Martin

On 18 April 2014 21:55, Chris Burrell <chris at burrell.me.uk> wrote:

> Hi Martin
>
> No, that's not what I'm saying. I've updated the pull request a few
> minutes ago with a comment to the exact opposite.
>
> The pull request doesn't change the content of anything. It adds new
> stemmed fields as separate document fields, and changes the configuration
> of some fields to be stored as well as indexed/analyzed.
>
> Have tested on both old and new indexes locally and it works absolutely
> fine. But it would be worth you testing as well.
>
> What bit was confused?
> Chris
>
>
>
> On 18 April 2014 21:50, Martin Denham <mjdenham at gmail.com> wrote:
>
>> Sorry, are you saying that the changes will break all previously
>> generated indexes?  This will be a problem.
>>
>> Martin
>>
>>
>> On 18 April 2014 21:05, Chris Burrell <chris at burrell.me.uk> wrote:
>>
>>> Hi DM
>>>
>>> *1- Stemming*
>>> Yes, I was expecting the stem to give a hit, but we found it matched
>>> more words than we were wanting. I can't think of an example off the top of
>>> my head. The other thing we found is that you can't share the same field
>>> because stems aren't always prefixes.
>>>
>>> (For example, there are two PorterStemmers available in Lucene 3 /
>>> JSword classpath at the moment - one of them, can't remember which, gives a
>>> stem for genealogy to be genealogi - the other gives the stem as genealog).
>>>
>>> So for highlighting, you definitely would need to use Lucene, and I'm
>>> not entirely sure how well it would cope
>>>
>>> In STEP we use it for various things, most of which are related to find
>>> a 'topic' or for identifying 'meanings' of words, rather than for actual
>>> word searches. When a user picks a word, they want that word. But we allow
>>> searching for 'love' as a topic, using Naves, or as a word, looking through
>>> a lexicon for all entries matching the stem.
>>>
>>> *2- Segregating apps*
>>> I think for this, we would want to allow a frontend to register it's
>>> name (prefix and name?). This would allow us to create indexes such as
>>> esv-bd, esv-ab, esv-step, etc. It would also allow for application specific
>>> sidecar configurations.  The logic would then go app-specific,
>>> jsword-specific, sword-specific.
>>>
>>> Chris
>>>
>>>
>>>
>>> On 18 April 2014 20:47, DM Smith <dmsmith at crosswire.org> wrote:
>>>
>>>>
>>>> On Apr 18, 2014, at 4:23 AM, Chris Burrell <chris at burrell.me.uk> wrote:
>>>>
>>>>
>>>>
>>>>
>>>> On 18 April 2014 01:40, DM Smith <dmsmith at crosswire.org> wrote:
>>>>
>>>>>
>>>>> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk>
>>>>> wrote:
>>>>>
>>>>> Hello
>>>>>
>>>>> STEP uses stemming to improve search results, in some queries (whether
>>>>> on Sword modules or otherwise).
>>>>>
>>>>>
>>>>> Stemming is very useful. On occasion, there is a need for a
>>>>> non-stemmed search. Especially for theological purposes. But for general
>>>>> purpose searching it should be the default.
>>>>>
>>>>> Are you suggesting we have 'heading' being the stemmed search and
>>>> fullHeading (or something like that) being the non-stemmed? I do think that
>>>> by default however, we should have the normal search. We experimented with
>>>> stemming in STEP by default and it can be quite confusing to look for a
>>>> particular word and hit others. Stemming doesn't always work the way you
>>>> expect.
>>>>
>>>>
>>>> I guess I'm confused by your previous comment. I thought you were
>>>> expecting the stem to give a hit.
>>>>
>>>> I think it is confusing because the 'hit' is not highlighted. If a stem
>>>> is highlighted then the user can quickly see and determine that it was
>>>> something they didn't want.
>>>>
>>>> Personally, I don't like stemming because I'm looking for a certain
>>>> word not heuristic variations of the word. Also, I don't like dropping stop
>>>> words (aka noise words) as many of them are theological significant (e.g.
>>>> in Christ).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> I've some times thought it'd be good to double index: stemmed and full
>>>>> word.
>>>>>
>>>>> Double indexing is a need if you want both. The stem for genealogy
>>>> resolves to genealogi (because of the plurals) which is why my search
>>>> wasn't hit. We can't use the same field.
>>>>
>>>>
>>>>>
>>>>> There are currently 2 limitations in JSword, both of which could
>>>>> easily be fixed. Please let me know if you have concerns around me
>>>>> implementing both.
>>>>>
>>>>> a- the frontend can't extend/control the use of indexes. I'm
>>>>> suggesting we add a registerFieldIndexer(fieldIndexer) with a simple
>>>>> interface: indexField(doc, osis). This would allow frontends to specify its
>>>>> own indexing. This would allow a frontend to index new things, or enable
>>>>> term vectors / store fields, etc.
>>>>>
>>>>>
>>>>> I'd really rather that we didn't go down this route. I don't mind
>>>>> plugin architecture as a way to experiment with different techniques, but
>>>>> I'd really rather that we all benefit from the changes.
>>>>>
>>>>> Fine.
>>>>
>>>>
>>>>>
>>>>> b- Extend the LuceneIndex to have a stemmed version of the heading. We
>>>>> could replace the existing index, but that would mean all frontends will
>>>>> require re-indexing.
>>>>>
>>>>>
>>>>> I think the same manner that we index the main verse text should be
>>>>> applied to all text: intro, heading and verse text.
>>>>>
>>>>> Happy to do the change for all three.
>>>>
>>>>
>>>> For Bible Desktop, we'll have to force re-indexing anyway. I'm finding
>>>> that the old indexes don't work with the current code. I'm looking forward
>>>> to using Sijo's code to fix this.
>>>>
>>>>
>>>>
>>>>>
>>>>> c- Had JSword been configured to 'STORE' the content of some fields, I
>>>>> would have used that for headings. For example, if the headings is stored
>>>>> in the index, STEP would not need to do an osis extract and XML transform
>>>>> to display to the user. It could come straight from the index. Two
>>>>> possibilities here: change the existing index field configuration, or
>>>>> duplicate into a different field.
>>>>>
>>>>>
>>>>> I think we should make store an option, possibly the standard.
>>>>>
>>>> What I don't want to happen is end up in a situation where the Index is
>>>> shared in different configurations by different apps. That would break the
>>>> frontend.
>>>>
>>>>
>>>> Yep. If we can agree on what and how, that'd be best.
>>>>
>>>> Even if you can ask, 'do you support', that's unnecessary complexity,
>>>> that means that a user will have to re-index each book he has to support
>>>> different front-ends. It also means that if a frontend forgets to ask
>>>> whether some fields are indexed in a particular way, then he's going to
>>>> have broken functionality in the frontend due to another frontend
>>>> overriding the defaults. At this stage, I'd rather have app-specific
>>>> indices.
>>>>
>>>>
>>>>
>>>>>
>>>>> Right now the way we do the index prevents us from using Lucene to
>>>>> highlight the search hit. If that is STORE, then I'd be in favor of making
>>>>> STORE standard. I wonder if our stripping the text to no include OSIS
>>>>> before indexing will frustrate this change.
>>>>>
>>>>> Store is a requirement for highlighting (
>>>> http://lucene.472066.n3.nabble.com/Highlighting-for-non-stored-fields-td1773015.htmland
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ).
>>>>
>>>>
>>>> It still should be an option for the sake of devices that are disk
>>>>> limited.
>>>>>
>>>>> d- the other side of c- is that ideally multiple headings should be
>>>>> stored in multiple entries to the same field, rather than a concatenation
>>>>> of the field (doesn't much matter if it's only ANALYZED)
>>>>>
>>>>>
>>>>> Some verses have headings in the middle of the verse. Don't make the
>>>>> mistake of assuming an order of heading. Or that heading contains only
>>>>> pre-verse material or all pre-verse material.
>>>>>
>>>>> I'm not making that mistake... All I'm saying is that headings should
>>>> be stored in different entries in the same field.
>>>> doc.add(fieldName, heading1);
>>>> doc.add(fieldName, heading2);
>>>> doc.add(fieldName, heading3);
>>>>
>>>> This means that you could retrieve one of the headings you want, rather
>>>> than all. i.e. Psalm 3.1 Non-canon-heading Canon-heading could have 3
>>>> separate fields.
>>>>
>>>>
>>>> This would be a good change.
>>>>
>>>>
>>>>
>>>>>
>>>>> *I only need one of a- or b- to be able to progress. Happy to do
>>>>> either. I don't need c- because I've worked around, but it would have been
>>>>> nice to have some control over that. *
>>>>>
>>>>> pros & cons:
>>>>> a- more extensible in the future, other frontends don't benefit from
>>>>> enhancements
>>>>> b- solves an immediate problem, but impacts all frontends (i.e. space
>>>>> used in index).
>>>>>
>>>>> The only other bit in my mind is whether we need to ensure
>>>>> index-cross-application compatibility. I suspect some of this will tie in
>>>>> with the good work that Sijo has done on index management.
>>>>>
>>>>>
>>>>> The index management will be more critical with such a change. I've
>>>>> talked about having a manifest which defines the characteristics of the
>>>>> index. If we share an index created by two different systems, it will be
>>>>> important to "know" what an index supports.
>>>>>
>>>>> as described above, I'd like to avoid this. I don't think a frontend
>>>> should have to worry about other frontends 'corrupting' the index (i.e.
>>>> redefining fields, changing the store status, etc.). I'd rather my own
>>>> index at that point.
>>>>
>>>>
>>>>
>>>>> One of the changes that is being worked on is the update to a more
>>>>> recent version of Lucene. This affects how stemming is done. The way we are
>>>>> doing it now is deprecated and dropped.
>>>>>
>>>>>
>>>>> Let me know what your preferences are.
>>>>>
>>>>>
>>>>> Progress not perfection. Shared, configurable changes.
>>>>>
>>>>> Chris
>>>>>
>>>>> _______________________________________________
>>>>> jsword-devel mailing list
>>>>> jsword-devel at crosswire.org
>>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> jsword-devel mailing list
>>> jsword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140418/f64bd4f5/attachment-0001.html>