[jsword-devel] Extending Lucene Indexes and stemming in particular

Chris Burrell chris at burrell.me.uk
Sun Apr 20 00:39:35 MST 2014


Hi Sijo

That wouldn't do what I want. I need the non stemmed body content and a
separate stemmed heading field.

Even if I did want the stemmed body, I would want it in addition to the non
stemmed body.

As I said, happy to remove the other ones. They were put in at DM s
suggestion.

Chris
 On 20 Apr 2014 03:09, "Sijo Cherian" <sijo.cherian at gmail.com> wrote:

> Chris,
> Since we already have a language based Analyzer configuration, if you can
> provide a custom jsword/src/main/resources/AnalyzerFactory.properties in
> STEP and add custom config for english like this:
>
>
> en.Analyzer=org.crosswire.jsword.index.lucene.analysis.ConfigurableSnowballAnalyzer
>
> This will stem the "content" field, both during indexing & query. Can you
> override prop files in your classpath, easily?
>
> Regarding your requirement to stem the heading: Since the current impl for
> "heading" uses the default analyzer, you will have to change prop
> "Default.Analyzer" to snowball, but that will have bigger impact - uses
> snowball for all other fields.
>
>
>
>
> On Sat, Apr 19, 2014 at 4:14 AM, Chris Burrell <chris at burrell.me.uk>wrote:
>
>> I don't mind configuration so long as these indexes are stored separately
>> per app.
>>
>> STEP relies on stemming and in places it uses it, we can't ask the user,
>> nor does it make sense there. So things would break and be quite hard to
>> debug.
>> Chris
>> On 19 Apr 2014 06:13, "Sijo Cherian" <sijo.cherian at gmail.com> wrote:
>>
>>>
>>> Great discussion. isProgress.
>>>
>>> I am still pondering all the benefits of double indexing the entire
>>> content.
>>>
>>> For specialized users, who don't want stemming factor in their
>>> searching: Can we provide a API for them to specify param like noStemming,
>>> noLowercase etc at the time of indexing on per-book basis, and persist
>>> those metadata in property file. Use exact  property during query analysis.
>>> These users probably won't want auto-reindexing on major jsword upgrade.
>>>
>>> Easter is almost here!
>>> -sijo
>>> On Thu, Apr 17, 2014 at 8:40 PM, DM Smith <dmsmith at crosswire.org> wrote:
>>>
>>>>
>>>> On Apr 17, 2014, at 12:09 PM, Chris Burrell <chris at burrell.me.uk>
>>>> wrote:
>>>>
>>>> Hello
>>>>
>>>> STEP uses stemming to improve search results, in some queries (whether
>>>> on Sword modules or otherwise).
>>>>
>>>>
>>>> Stemming is very useful. On occasion, there is a need for a non-stemmed
>>>> search. Especially for theological purposes. But for general purpose
>>>> searching it should be the default.
>>>>
>>>> I've some times thought it'd be good to double index: stemmed and full
>>>> word.
>>>>
>>>>
>>>> There are currently 2 limitations in JSword, both of which could easily
>>>> be fixed. Please let me know if you have concerns around me implementing
>>>> both.
>>>>
>>>> a- the frontend can't extend/control the use of indexes. I'm suggesting
>>>> we add a registerFieldIndexer(fieldIndexer) with a simple interface:
>>>> indexField(doc, osis). This would allow frontends to specify its own
>>>> indexing. This would allow a frontend to index new things, or enable term
>>>> vectors / store fields, etc.
>>>>
>>>>
>>>> I'd really rather that we didn't go down this route. I don't mind
>>>> plugin architecture as a way to experiment with different techniques, but
>>>> I'd really rather that we all benefit from the changes.
>>>>
>>>>
>>>> b- Extend the LuceneIndex to have a stemmed version of the heading. We
>>>> could replace the existing index, but that would mean all frontends will
>>>> require re-indexing.
>>>>
>>>>
>>>> I think the same manner that we index the main verse text should be
>>>> applied to all text: intro, heading and verse text.
>>>>
>>>>
>>>> c- Had JSword been configured to 'STORE' the content of some fields, I
>>>> would have used that for headings. For example, if the headings is stored
>>>> in the index, STEP would not need to do an osis extract and XML transform
>>>> to display to the user. It could come straight from the index. Two
>>>> possibilities here: change the existing index field configuration, or
>>>> duplicate into a different field.
>>>>
>>>>
>>>> I think we should make store an option, possibly the standard.
>>>>
>>>> Right now the way we do the index prevents us from using Lucene to
>>>> highlight the search hit. If that is STORE, then I'd be in favor of making
>>>> STORE standard. I wonder if our stripping the text to no include OSIS
>>>> before indexing will frustrate this change.
>>>>
>>>> It still should be an option for the sake of devices that are disk
>>>> limited.
>>>>
>>>> d- the other side of c- is that ideally multiple headings should be
>>>> stored in multiple entries to the same field, rather than a concatenation
>>>> of the field (doesn't much matter if it's only ANALYZED)
>>>>
>>>>
>>>> Some verses have headings in the middle of the verse. Don't make the
>>>> mistake of assuming an order of heading. Or that heading contains only
>>>> pre-verse material or all pre-verse material.
>>>>
>>>>
>>>> *I only need one of a- or b- to be able to progress. Happy to do
>>>> either. I don't need c- because I've worked around, but it would have been
>>>> nice to have some control over that. *
>>>>
>>>> pros & cons:
>>>> a- more extensible in the future, other frontends don't benefit from
>>>> enhancements
>>>> b- solves an immediate problem, but impacts all frontends (i.e. space
>>>> used in index).
>>>>
>>>> The only other bit in my mind is whether we need to ensure
>>>> index-cross-application compatibility. I suspect some of this will tie in
>>>> with the good work that Sijo has done on index management.
>>>>
>>>>
>>>> The index management will be more critical with such a change. I've
>>>> talked about having a manifest which defines the characteristics of the
>>>> index. If we share an index created by two different systems, it will be
>>>> important to "know" what an index supports.
>>>>
>>>> One of the changes that is being worked on is the update to a more
>>>> recent version of Lucene. This affects how stemming is done. The way we are
>>>> doing it now is deprecated and dropped.
>>>>
>>>>
>>>> Let me know what your preferences are.
>>>>
>>>>
>>>> Progress not perfection. Shared, configurable changes.
>>>>
>>>> Chris
>>>>
>>>> _______________________________________________
>>>> jsword-devel mailing list
>>>> jsword-devel at crosswire.org
>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> jsword-devel mailing list
>>>> jsword-devel at crosswire.org
>>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Sijo
>>>
>>> _______________________________________________
>>> jsword-devel mailing list
>>> jsword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>>
>>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>>
>
>
> --
> Regards,
> Sijo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20140420/fa0f09b1/attachment.html>


More information about the jsword-devel mailing list