[jsword-devel] Extending Lucene Indexes and stemming in particular

Chris Burrell chris at burrell.me.uk
Fri Apr 18 06:17:22 MST 2014


Just one more - in terms to the search highlighting, I'm not quite sure
what we are hoping for. What I don't think Lucene will be able to do is to
highlight a English/German word based on the fact that it was tagged and
searched by strong number.

So the only advantage for highlighting is for the ~ search (approximate,
such as Melkizedk~) For all others, it's pretty easy for a frontend to
guess. With stemming I guess this becomes a bit more important, especially
if the stem isn't a prefix to the whole word (which happens as in the
example above)

Chris



On 18 April 2014 09:23, Chris Burrell <chris at burrell.me.uk> wrote:

>
>
>
> On 18 April 2014 01:40, DM Smith <dmsmith at crosswire.org> wrote:
>
>>
>> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>>
>> Hello
>>
>> STEP uses stemming to improve search results, in some queries (whether on
>> Sword modules or otherwise).
>>
>>
>> Stemming is very useful. On occasion, there is a need for a non-stemmed
>> search. Especially for theological purposes. But for general purpose
>> searching it should be the default.
>>
>> Are you suggesting we have 'heading' being the stemmed search and
> fullHeading (or something like that) being the non-stemmed? I do think that
> by default however, we should have the normal search. We experimented with
> stemming in STEP by default and it can be quite confusing to look for a
> particular word and hit others. Stemming doesn't always work the way you
> expect.
>
>
>
>> I've some times thought it'd be good to double index: stemmed and full
>> word.
>>
>> Double indexing is a need if you want both. The stem for genealogy
> resolves to genealogi (because of the plurals) which is why my search
> wasn't hit. We can't use the same field.
>
>
>>
>> There are currently 2 limitations in JSword, both of which could easily
>> be fixed. Please let me know if you have concerns around me implementing
>> both.
>>
>> a- the frontend can't extend/control the use of indexes. I'm suggesting
>> we add a registerFieldIndexer(fieldIndexer) with a simple interface:
>> indexField(doc, osis). This would allow frontends to specify its own
>> indexing. This would allow a frontend to index new things, or enable term
>> vectors / store fields, etc.
>>
>>
>> I'd really rather that we didn't go down this route. I don't mind plugin
>> architecture as a way to experiment with different techniques, but I'd
>> really rather that we all benefit from the changes.
>>
>> Fine.
>
>
>>
>> b- Extend the LuceneIndex to have a stemmed version of the heading. We
>> could replace the existing index, but that would mean all frontends will
>> require re-indexing.
>>
>>
>> I think the same manner that we index the main verse text should be
>> applied to all text: intro, heading and verse text.
>>
>> Happy to do the change for all three.
>
>
>>
>> c- Had JSword been configured to 'STORE' the content of some fields, I
>> would have used that for headings. For example, if the headings is stored
>> in the index, STEP would not need to do an osis extract and XML transform
>> to display to the user. It could come straight from the index. Two
>> possibilities here: change the existing index field configuration, or
>> duplicate into a different field.
>>
>>
>> I think we should make store an option, possibly the standard.
>>
> What I don't want to happen is end up in a situation where the Index is
> shared in different configurations by different apps. That would break the
> frontend. Even if you can ask, 'do you support', that's unnecessary
> complexity, that means that a user will have to re-index each book he has
> to support different front-ends. It also means that if a frontend forgets
> to ask whether some fields are indexed in a particular way, then he's going
> to have broken functionality in the frontend due to another frontend
> overriding the defaults. At this stage, I'd rather have app-specific
> indices.
>
>
>
>>
>> Right now the way we do the index prevents us from using Lucene to
>> highlight the search hit. If that is STORE, then I'd be in favor of making
>> STORE standard. I wonder if our stripping the text to no include OSIS
>> before indexing will frustrate this change.
>>
>> Store is a requirement for highlighting (
> http://lucene.472066.n3.nabble.com/Highlighting-for-non-stored-fields-td1773015.htmland
> http://wiki.apache.org/lucene-java/LuceneFAQ).
>
>
> It still should be an option for the sake of devices that are disk limited.
>>
>> d- the other side of c- is that ideally multiple headings should be
>> stored in multiple entries to the same field, rather than a concatenation
>> of the field (doesn't much matter if it's only ANALYZED)
>>
>>
>> Some verses have headings in the middle of the verse. Don't make the
>> mistake of assuming an order of heading. Or that heading contains only
>> pre-verse material or all pre-verse material.
>>
>> I'm not making that mistake... All I'm saying is that headings should be
> stored in different entries in the same field.
> doc.add(fieldName, heading1);
> doc.add(fieldName, heading2);
> doc.add(fieldName, heading3);
>
> This means that you could retrieve one of the headings you want, rather
> than all. i.e. Psalm 3.1 Non-canon-heading Canon-heading could have 3
> separate fields.
>
>
>>
>> *I only need one of a- or b- to be able to progress. Happy to do either.
>> I don't need c- because I've worked around, but it would have been nice to
>> have some control over that. *
>>
>> pros & cons:
>> a- more extensible in the future, other frontends don't benefit from
>> enhancements
>> b- solves an immediate problem, but impacts all frontends (i.e. space
>> used in index).
>>
>> The only other bit in my mind is whether we need to ensure
>> index-cross-application compatibility. I suspect some of this will tie in
>> with the good work that Sijo has done on index management.
>>
>>
>> The index management will be more critical with such a change. I've
>> talked about having a manifest which defines the characteristics of the
>> index. If we share an index created by two different systems, it will be
>> important to "know" what an index supports.
>>
>> as described above, I'd like to avoid this. I don't think a frontend
> should have to worry about other frontends 'corrupting' the index (i.e.
> redefining fields, changing the store status, etc.). I'd rather my own
> index at that point.
>
>
>
>> One of the changes that is being worked on is the update to a more recent
>> version of Lucene. This affects how stemming is done. The way we are doing
>> it now is deprecated and dropped.
>>
>>
>> Let me know what your preferences are.
>>
>>
>> Progress not perfection. Shared, configurable changes.
>>
>> Chris
>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140418/a0e28160/attachment-0001.html>


More information about the jsword-devel mailing list