[jsword-devel] Searching in German

Fri Feb 8 06:56:42 MST 2008

On Feb 8, 2008, at 8:02 AM, Jonathan Morgan wrote:

> On Feb 7, 2008 1:59 AM, Manfred Bergmann <bergmannmd at yahoo.de> wrote:
>>
>> Am 06.02.2008 um 15:34 schrieb DM Smith:
>>
>>> Manfred Bergmann wrote:
>>>> Hi DM.
>>>>
>>>> AFAIK, Lucene from version 2.1 can deal with leading wildcards.
>>>>
>>> Lucene 2.3 still throws an error: "Cannot parse '*ning': '*' or '?'
>>> not
>>> allowed as first character in WildcardQuery"
>>>
>>> Perhaps, there's something that needs to be done to enable it?
>>
>> Yes, there is a flag in QueryParser class called
>> setAllowLeadingWindcard(boolean).
>> It is available since v2.1.
>>
>>>
>>>
>>>> It would find "*schiff".
>>>> Wouldn't this be enough?
>>>>
>>> If a German speaker searches for "schiff" would they expect to  
>>> find in
>>> words like donaudampfschiff?
>>
>> They probably would expect to find it.
>> But sometimes you find search options like "exact search phrase".  
>> Then
>> it should find "schiff".
>> Else I would expect that the search engine adds wildcards in front  
>> and
>> behind so that any words are found containing this token.
>
> For what it's worth, I believe that exact search should be the
> default, for two main reasons:
> 1. Non-exact search has (in my opinion) greater potential to surprise
> the user than exact search.  For example, if I search for thirst in
> the Bible, then (depending on my version) I will get results such as
> thirst, thirsts, and thirsted (all of which I may want) but I will
> also get bloodthirsty.  If I search for need, then I will get need,
> needs, needed, needy, etc., but I will also get needlework.  I have
> had quite a few searches (which I can't remember offhand now) where
> non-exact search found me a few extra verses I wanted, but had a
> greater than 70% false positive rate, and as a user I don't expect
> behaviour like this to be the default.

I agree. Whole words will be the default. Sub-words might be  
reasonable to add as an "Advanced Search" capability.

However German word construction is different than English and  
compound words are freely created. However, searching *word* will find  
more than just sub-words, it actually finds sub-strings.

The original suggestion was to add indexing of word parts. In lieu of  
that, prefixed wild-card search is sufficient.

In following the Lucene issue, it may have died on licensing issues.  
Lucene will only include public domain or ASF licensed code and the  
hyphenation files are licensed under many different open source  
licenses. Which is sufficient for our needs.

> What I actually really want is
> a way to search for words need and all its derivatives without
> including every word with need in it (does stemming or something
> similar support this, and if so, does BD include it?)

With the 1.0.8 release, BD will have stemming in many different  
languages and it will be applied based upon the language of the Book/ 
module. It will require deleting and re-indexing the module. Stemming  
uses the Snowball code and it is not exact, but heuristic.

Many thanks to Sijo Cherian for providing the code!

You can get it today with the nightly build and give feedback.

> As a user, if I
> search for a thing and notice that it isn't coming up with exact
> search, then I can easily switch to non-exact search.  However, if
> non-exact search is the default then there is a greater chance of me
> being drowned with results that I don't want.  The aim of a search
> should probably be to get the minimum useful set of results.
>
> 2. Performance: I don't know what (if any) performance impact results
> from using a wildcard search of form *term*, but if it significantly
> affects the performance on the search then it probably shouldn't be
> the default.

The performance impact is significant, but not that significant. You  
can give it a try in the nightly build.

Giving the users the ability to choose their performance curve would  
be best, if we add it at all.

In Him,
	DM