[jsword-devel] Extending Lucene Indexes and stemming in particular

Chris Burrell chris at burrell.me.uk
Sat Apr 19 01:58:52 MST 2014


Hi Martin

What's your suggestion? Other frontends are being hampered with the lack of
upgrade of Lucene.  Automatic download of indexes upon upgrade doesn't
sound like a disaster to me? It sounds annoying and necessary.

Lucene doesn't guarantee backwards compatibility past the previous major
version, and even for that previous major version DM is suggesting that
they don't work well.

The only two other options I can think of are:
- never upgrade
- keep a legacy branch for AB

Both of those sound worse.
Chris



On 19 April 2014 09:51, Martin Denham <mjdenham at gmail.com> wrote:

> That would be a disaster - 80,000 mobile devices having to download at
> least a few indexes each, most over 3G!  Alternatively attempting to force
> low power devices to regenerate all their indexes just won't work.
>
> Backwards compatibility is essential for And Bible.
>
>
> On 19 April 2014 09:40, Chris Burrell <chris at burrell.me.uk> wrote:
>
>> If that's the case, why don't we jump to 4.7.2? From the sounds of it,
>> we'll all need to rebuild indexes, and for AB they're going to need to
>> download new indexes.
>>
>> Chris
>>
>>
>>
>> On 19 April 2014 03:34, DM Smith <dmsmith at crosswire.org> wrote:
>>
>>> Currently we're on Lucene 3.0.3. The next logical step is to get to 3.6,
>>> even if it is just a stepping stone to 4. The way Lucene does releases is
>>> that they deprecate stuff in several releases and then remove them in a
>>> later release. Jumping from 3.0.3 to 4.0 is nearly a re-write of our code's
>>> use of Lucene. It is even going to 3.6, but there are helps to get there.
>>>
>>> Going to 3.6 will probably require re-indexing all modules. Going to 4.0
>>> will require it.
>>>
>>> While 4.0 can read 3.x indexes, it is much more complicated to prevent
>>> "invalid" indexes. Essentially an index has to be searched by exactly the
>>> same normalization method used to construct the index. Getting from where
>>> we are now to 3.6 or 4.0 will make that really, really hard. The upshot is
>>> I wouldn't trust later Lucene to return a proper search result against an
>>> earlier index.
>>>
>>> One of the major architectural changes after 3.0.3 is how the Filter,
>>> Analyzers and Tokenizers are written. They also no longer work on strings,
>>> but character buffers. The other major change is to StandardAnalyzer to
>>> follow UAX 29 (Yay!), so we should use it instead of SimpleAnalyzer (if we
>>> can keep it from stripping stop words).
>>>
>>> I think Sijo is working on getting us to 4.0.
>>>
>>> -- DM
>>>
>>> On Apr 18, 2014, at 6:30 PM, Chris Burrell <chris at burrell.me.uk> wrote:
>>>
>>> At some point, I'd like to upgrade to Lucene 4 by the way as there are
>>> some nice features around auto-completion, etc. that I'd like to use in
>>> STEP. STEP currently inherits the Lucene libs from JSword's classpath.
>>> Because of the API changes to Lucene I'm blocked on this. It also has some
>>> performance improvements but apparently causes 10%+ slow down if running in
>>> version 3 indexes.
>>>
>>> So not quite sure how we manage this?
>>>
>>> Chris
>>>
>>> Apologies, I hadn't read your comment on the pull request before I saw
>>> DM's comment above "For Bible Desktop, we'll have to force re-indexing
>>> anyway. I'm finding that the old indexes don't work with the current code.
>>> I'm looking forward to using Sijo's code to fix this."
>>>
>>> Martin
>>> Apologies, I hadn't read your comment on the pull request before I saw
>>> DM's comment above "For Bible Desktop, we'll have to force re-indexing
>>> anyway. I'm finding that the old indexes don't work with the current code.
>>> I'm looking forward to using Sijo's code to fix this."
>>>
>>> Martin
>>>
>>>
>>> On 18 April 2014 21:55, Chris Burrell <chris at burrell.me.uk> wrote:
>>>
>>>> Hi Martin
>>>>
>>>> No, that's not what I'm saying. I've updated the pull request a few
>>>> minutes ago with a comment to the exact opposite.
>>>>
>>>> The pull request doesn't change the content of anything. It adds new
>>>> stemmed fields as separate document fields, and changes the configuration
>>>> of some fields to be stored as well as indexed/analyzed.
>>>>
>>>> Have tested on both old and new indexes locally and it works absolutely
>>>> fine. But it would be worth you testing as well.
>>>>
>>>> What bit was confused?
>>>> Chris
>>>>
>>>>
>>>>
>>>> On 18 April 2014 21:50, Martin Denham <mjdenham at gmail.com> wrote:
>>>>
>>>>> Sorry, are you saying that the changes will break all previously
>>>>> generated indexes?  This will be a problem.
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> On 18 April 2014 21:05, Chris Burrell <chris at burrell.me.uk> wrote:
>>>>>
>>>>>> Hi DM
>>>>>>
>>>>>> *1- Stemming*
>>>>>> Yes, I was expecting the stem to give a hit, but we found it matched
>>>>>> more words than we were wanting. I can't think of an example off the top of
>>>>>> my head. The other thing we found is that you can't share the same field
>>>>>> because stems aren't always prefixes.
>>>>>>
>>>>>> (For example, there are two PorterStemmers available in Lucene 3 /
>>>>>> JSword classpath at the moment - one of them, can't remember which, gives a
>>>>>> stem for genealogy to be genealogi - the other gives the stem as genealog).
>>>>>>
>>>>>> So for highlighting, you definitely would need to use Lucene, and I'm
>>>>>> not entirely sure how well it would cope
>>>>>>
>>>>>> In STEP we use it for various things, most of which are related to
>>>>>> find a 'topic' or for identifying 'meanings' of words, rather than for
>>>>>> actual word searches. When a user picks a word, they want that word. But we
>>>>>> allow searching for 'love' as a topic, using Naves, or as a word, looking
>>>>>> through a lexicon for all entries matching the stem.
>>>>>>
>>>>>> *2- Segregating apps*
>>>>>> I think for this, we would want to allow a frontend to register it's
>>>>>> name (prefix and name?). This would allow us to create indexes such as
>>>>>> esv-bd, esv-ab, esv-step, etc. It would also allow for application specific
>>>>>> sidecar configurations.  The logic would then go app-specific,
>>>>>> jsword-specific, sword-specific.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 18 April 2014 20:47, DM Smith <dmsmith at crosswire.org> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Apr 18, 2014, at 4:23 AM, Chris Burrell <chris at burrell.me.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 18 April 2014 01:40, DM Smith <dmsmith at crosswire.org> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hello
>>>>>>>>
>>>>>>>> STEP uses stemming to improve search results, in some queries
>>>>>>>> (whether on Sword modules or otherwise).
>>>>>>>>
>>>>>>>>
>>>>>>>> Stemming is very useful. On occasion, there is a need for a
>>>>>>>> non-stemmed search. Especially for theological purposes. But for general
>>>>>>>> purpose searching it should be the default.
>>>>>>>>
>>>>>>>> Are you suggesting we have 'heading' being the stemmed search and
>>>>>>> fullHeading (or something like that) being the non-stemmed? I do think that
>>>>>>> by default however, we should have the normal search. We experimented with
>>>>>>> stemming in STEP by default and it can be quite confusing to look for a
>>>>>>> particular word and hit others. Stemming doesn't always work the way you
>>>>>>> expect.
>>>>>>>
>>>>>>>
>>>>>>> I guess I'm confused by your previous comment. I thought you were
>>>>>>> expecting the stem to give a hit.
>>>>>>>
>>>>>>> I think it is confusing because the 'hit' is not highlighted. If a
>>>>>>> stem is highlighted then the user can quickly see and determine that it was
>>>>>>> something they didn't want.
>>>>>>>
>>>>>>> Personally, I don't like stemming because I'm looking for a certain
>>>>>>> word not heuristic variations of the word. Also, I don't like dropping stop
>>>>>>> words (aka noise words) as many of them are theological significant (e.g.
>>>>>>> in Christ).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> I've some times thought it'd be good to double index: stemmed and
>>>>>>>> full word.
>>>>>>>>
>>>>>>>> Double indexing is a need if you want both. The stem for genealogy
>>>>>>> resolves to genealogi (because of the plurals) which is why my search
>>>>>>> wasn't hit. We can't use the same field.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> There are currently 2 limitations in JSword, both of which could
>>>>>>>> easily be fixed. Please let me know if you have concerns around me
>>>>>>>> implementing both.
>>>>>>>>
>>>>>>>> a- the frontend can't extend/control the use of indexes. I'm
>>>>>>>> suggesting we add a registerFieldIndexer(fieldIndexer) with a simple
>>>>>>>> interface: indexField(doc, osis). This would allow frontends to specify its
>>>>>>>> own indexing. This would allow a frontend to index new things, or enable
>>>>>>>> term vectors / store fields, etc.
>>>>>>>>
>>>>>>>>
>>>>>>>> I'd really rather that we didn't go down this route. I don't mind
>>>>>>>> plugin architecture as a way to experiment with different techniques, but
>>>>>>>> I'd really rather that we all benefit from the changes.
>>>>>>>>
>>>>>>>> Fine.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> b- Extend the LuceneIndex to have a stemmed version of the heading.
>>>>>>>> We could replace the existing index, but that would mean all frontends will
>>>>>>>> require re-indexing.
>>>>>>>>
>>>>>>>>
>>>>>>>> I think the same manner that we index the main verse text should be
>>>>>>>> applied to all text: intro, heading and verse text.
>>>>>>>>
>>>>>>>> Happy to do the change for all three.
>>>>>>>
>>>>>>>
>>>>>>> For Bible Desktop, we'll have to force re-indexing anyway. I'm
>>>>>>> finding that the old indexes don't work with the current code. I'm looking
>>>>>>> forward to using Sijo's code to fix this.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> c- Had JSword been configured to 'STORE' the content of some
>>>>>>>> fields, I would have used that for headings. For example, if the headings
>>>>>>>> is stored in the index, STEP would not need to do an osis extract and XML
>>>>>>>> transform to display to the user. It could come straight from the index.
>>>>>>>> Two possibilities here: change the existing index field configuration, or
>>>>>>>> duplicate into a different field.
>>>>>>>>
>>>>>>>>
>>>>>>>> I think we should make store an option, possibly the standard.
>>>>>>>>
>>>>>>> What I don't want to happen is end up in a situation where the Index
>>>>>>> is shared in different configurations by different apps. That would break
>>>>>>> the frontend.
>>>>>>>
>>>>>>>
>>>>>>> Yep. If we can agree on what and how, that'd be best.
>>>>>>>
>>>>>>> Even if you can ask, 'do you support', that's unnecessary
>>>>>>> complexity, that means that a user will have to re-index each book he has
>>>>>>> to support different front-ends. It also means that if a frontend forgets
>>>>>>> to ask whether some fields are indexed in a particular way, then he's going
>>>>>>> to have broken functionality in the frontend due to another frontend
>>>>>>> overriding the defaults. At this stage, I'd rather have app-specific
>>>>>>> indices.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Right now the way we do the index prevents us from using Lucene to
>>>>>>>> highlight the search hit. If that is STORE, then I'd be in favor of making
>>>>>>>> STORE standard. I wonder if our stripping the text to no include OSIS
>>>>>>>> before indexing will frustrate this change.
>>>>>>>>
>>>>>>>> Store is a requirement for highlighting (
>>>>>>> http://lucene.472066.n3.nabble.com/Highlighting-for-non-stored-fields-td1773015.htmland
>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ).
>>>>>>>
>>>>>>>
>>>>>>> It still should be an option for the sake of devices that are disk
>>>>>>>> limited.
>>>>>>>>
>>>>>>>> d- the other side of c- is that ideally multiple headings should be
>>>>>>>> stored in multiple entries to the same field, rather than a concatenation
>>>>>>>> of the field (doesn't much matter if it's only ANALYZED)
>>>>>>>>
>>>>>>>>
>>>>>>>> Some verses have headings in the middle of the verse. Don't make
>>>>>>>> the mistake of assuming an order of heading. Or that heading contains only
>>>>>>>> pre-verse material or all pre-verse material.
>>>>>>>>
>>>>>>>> I'm not making that mistake... All I'm saying is that headings
>>>>>>> should be stored in different entries in the same field.
>>>>>>> doc.add(fieldName, heading1);
>>>>>>> doc.add(fieldName, heading2);
>>>>>>> doc.add(fieldName, heading3);
>>>>>>>
>>>>>>> This means that you could retrieve one of the headings you want,
>>>>>>> rather than all. i.e. Psalm 3.1 Non-canon-heading Canon-heading could have
>>>>>>> 3 separate fields.
>>>>>>>
>>>>>>>
>>>>>>> This would be a good change.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> *I only need one of a- or b- to be able to progress. Happy to do
>>>>>>>> either. I don't need c- because I've worked around, but it would have been
>>>>>>>> nice to have some control over that. *
>>>>>>>>
>>>>>>>> pros & cons:
>>>>>>>> a- more extensible in the future, other frontends don't benefit
>>>>>>>> from enhancements
>>>>>>>> b- solves an immediate problem, but impacts all frontends (i.e.
>>>>>>>> space used in index).
>>>>>>>>
>>>>>>>> The only other bit in my mind is whether we need to ensure
>>>>>>>> index-cross-application compatibility. I suspect some of this will tie in
>>>>>>>> with the good work that Sijo has done on index management.
>>>>>>>>
>>>>>>>>
>>>>>>>> The index management will be more critical with such a change. I've
>>>>>>>> talked about having a manifest which defines the characteristics of the
>>>>>>>> index. If we share an index created by two different systems, it will be
>>>>>>>> important to "know" what an index supports.
>>>>>>>>
>>>>>>>> as described above, I'd like to avoid this. I don't think a
>>>>>>> frontend should have to worry about other frontends 'corrupting' the index
>>>>>>> (i.e. redefining fields, changing the store status, etc.). I'd rather my
>>>>>>> own index at that point.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> One of the changes that is being worked on is the update to a more
>>>>>>>> recent version of Lucene. This affects how stemming is done. The way we are
>>>>>>>> doing it now is deprecated and dropped.
>>>>>>>>
>>>>>>>>
>>>>>>>> Let me know what your preferences are.
>>>>>>>>
>>>>>>>>
>>>>>>>> Progress not perfection. Shared, configurable changes.
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> jsword-devel mailing list
>>>>>>>> jsword-devel at crosswire.org
>>>>>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> jsword-devel mailing list
>>>>>> jsword-devel at crosswire.org
>>>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>>>
>>>>>>
>>>>>
>>>>
>>>  _______________________________________________
>>> jsword-devel mailing list
>>> jsword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140419/8725cdd0/attachment-0001.html>


More information about the jsword-devel mailing list