[sword-devel] [sword-support] Locales
Troy A. Griffitts
scribe at crosswire.org
Sat Sep 13 14:31:40 MST 2008
Thanks for the comments DM.
DM Smith wrote:
> Some observations I've made regarding Lucene (that may also apply to
> any other search engine):
>
> The index and the search request must be normalized in the same
> fashion. There are several aspects to normalization:
Right, maybe I didn't make it clear in my last email, but this is
exactly what I was trying to say is already in place. The lucene index
creator in SWORD and also the non-index search mechanism in SWORD both
also call the StripText method. Since the frontends also call this
method, everything is normalized in the say way. Now we may not have
enough normalization enforcing going on in our StripFilters, which was
my comment about adding better handling of say the Czech language with a
filter that would do whatever we decide for Czech (no accents, or
whatever). If we want, we could force encoding normalization using ICU
here also. We know the encoding of old modules because of the Encoding=
entry in the .conf. So, as an example, we could add correct encoding
NFC filter via SWMgr to each module based on the encoding entry, and
then for a language like German, something to make everything use ae oe,
ue, st, or whatever we decide is best.
But the issue isn't a common place to be sure both the search string and
the text are encoded and normalized the same. There is already that
concept in the engine. We just need to add further normalization filters.
Would be nice to possibly have SWMgr add the appropriate Analyzer for
the language as well, if clucene is available. Nice point.
Thanks for the comments.
-Troy.
> The same Lucene analyzer that is used to build the index needs to be
> used to prepare the user input for search. The responsibility of a
> Lucene analyzer is to do data normalization and tokenization for both
> indexing and search. However, Lucene does not normalize the input
> character encoding.
>
> While our new modules are NFC, there is nothing to say whether our
> older modules are NFC or not. When the index is built it is important
> to know what is stored and how it is stored. I.e. whether it is UTF-8
> or cp1252, and if UTF-8 whether it is NFC. And it is important to know
> whether the diacritics are removed or not. Until we have deterministic
> knowledge of this, we cannot normalize the search request to match the
> index. And if the two don't match searches will give wrong results.
>
> The user's search request needs to be normalized in exactly the same
> fashion as the index. Generally the user will input decomposed UTF-8,
> that is they will enter a letter and then the diacritics. When there
> are more than one diacritic they can generally be in any order. If the
> user is cutting and pasting from Latin-1 and searching UTF-8, (or visa
> versa) that's a problem too. The other thing Peter pointed out is that
> some user input is language dependent such as the ae, oe, ue for
> umlauted a, o and e.
>
> Lucene's StandardAnalyzer is appropriate for English, but not for
> other languages as it uses English stop words, English rules for
> acronyms, etc. And for Thai and other languages that don't use spaces
> to separate words a different "break iterator" is needed. Ultimately,
> each language needs its own analyzer.
>
> The generally recommended way to index diacritical text:
> Normalize to a known encoding (e.g. UTF-8, NFC) and store it in a
> field in multiple forms, e.g.:
> As is.
> Un-accented.
> Alternate language dependent forms. e.g. stemmed, umlauts expanded,
> compound words separated, ....
> The trick here is that these all have the same position increment.
>
> In Him,
> DM
>
>
> On Sep 13, 2008, at 10:20 AM, Troy A. Griffitts wrote:
>
>> Thanks Peter,
>>
>> Yeah, I believe our new modules are normalized with ICU to be standard
>> NFC (Normal Form Composed). Here's an interesting comment regarding
>> Arabic:
>>
>> http://unicode.org/faq/normalization.html#8
>>
>> You suggestion about normalizing the search string and also the
>> indexed
>> search text of the module is exactly what we do for greek. You can
>> search with or without diacritics and transcription annotation:
>> [](),etc. and find results with or without such.
>>
>> In SWORD there is a concept of 'Strip Filters' which are used to
>> filter
>> the text body before sending to the indexer. These typically remove
>> all
>> the markup. Some modules have extra filters added, by placing an
>> extra
>> entry in their .conf file. And example of these is the papyri
>> transcription annotation mentioned above. You will see the line:
>>
>> LocalStripFilter=PapyriPlain
>>
>> added to:
>>
>> hesychius.conf
>> phi_chr.conf
>> ddp.conf
>>
>> And the SWORD engine has an overloaded SWModule::StripText() method.
>>
>> Called with no parameters will return the stripped text of the module.
>> If you supply a const char *buffer, the method will run your
>> buffer through the same filters as the module uses.
>>
>> So typically, before sending a search term supplied by a user to the
>> search method, a programmer would call StripText on the search term,
>> eg.
>>
>> SWBuf userSearchTerm = searchEditBox.getText();
>> userSearchTerm = currentModule.StripText(userSearchTerm);
>> ListKey results = currentModule.search(userSearchTerm);
>>
>> If I'm not explaining clearly how this applies... if we decided to
>> add:
>>
>> localStripFilter=czNormalize
>>
>> to: czecep.conf
>>
>> (provided we had a simple filter which decided how to normalize Czech)
>>
>> Everything should be in place to make things work.
>>
>> Does this make sense?
>>
>> -Troy.
>>
>>
>>
>>
>>
>> Peter von Kaehne wrote:
>>> DM and I thought about this a while back wrt some problems we had
>>> with Farsi - essentially there are three scenarios for each
>>> diacritic sign - not there, integrated or extra. Modules usually
>>> are a mixture of integrated use of diacritics and extra, more or
>>> less pure one or the other.
>>>
>>> Search entries depend heavily on the keyboard available - a German
>>> searching on a German keyboard will use umlauts, a German searching
>>> on a British keyboard will use ae, ue or oe, someone else searching
>>> a German text might well search simply for a, e or u.
>>>
>>> So the best way forward appeared at the time to normalise both
>>> text and search entry and accept the possibility of extraneous
>>> results - particularly around latinate scripts.
>>>
>>> Alternatively - and I think there is a lot of mileage in there - we
>>> should/could demand that modules are designed cleanly in terms of
>>> diacritics (i.e. only sequential) and rectified whereever there is
>>> a problem. Subsequently only the search entries would need to be
>>> normalised or even better could be subject to user settings
>>>
>>> Peter
>>>
>>>
>>>
>>>
>>> -------- Original-Nachricht --------
>>>> Datum: Sat, 13 Sep 2008 08:43:08 +0100
>>>> Von: "Troy A. Griffitts" <scribe at crosswire.org>
>>>> An: SWORD Support Volunteers <sword-support at crosswire.org>, refdoc at gmx.net
>>>> , SWORD Developers\' Collaboration Forum <sword-devel at crosswire.org>
>>>> Betreff: Re: [sword-support] Locales
>>>> I would guess if we build lucene indexes for that Bible, the lucene
>>>> would search ignoring accents?
>>>>
>>>> Or that module is not UTF-8?
>>>>
>>>> We have filters that we use on ancient Greek texts that allow
>>>> searching
>>>> regarless of diacritics. He could add a set for any language, but
>>>> I'm
>>>> not sure if this is the right location to place responsibility.
>>>> Maybe
>>>> if it was an ICU filter that could work for any language-- like if
>>>> it's
>>>> just a normalization problem. We could use that one filter for all
>>>> Bibles like we do the filter for Greek.
>>>>
>>>> Not sure, just thinking out loud.
>>>>
>>>> -Troy.
>>>>
>>>>
>>>>
>>>>
>>>> Peter von Kaehne wrote:
>>>>> Thanks. this is a known problem which caases a lot of
>>>>> difficulties - in
>>>> all languages which rely on diacritics.
>>>>> There is a plan to improve the search facility.
>>>>>
>>>>> Peter
>>>>>
>>>>> -------- Original-Nachricht --------
>>>>>> Datum: Fri, 12 Sep 2008 19:57:58 +0200 (CEST)
>>>>>> An: sword-bugs at crosswire.org
>>>>>> Betreff: [sword-support] Locales
>>>>>> Peace and love to my brothers and sisters in Jesus Christ, our
>>>>>> Lord,
>>>> from
>>>>>> Jan, His weak servant.
>>>>>>
>>>>>> I am sorry to inform you about an error in the search engine of
>>>>>> The
>>>> Bible
>>>>>> Tool. While using Czech the search does not correctly interprets
>>>>>> all
>>>> the
>>>>>> letters with diacritic, e.g.
>>>>>>
>>>>>> while typing the request:
>>>>>>
>>>>>> Nesl svůj kříž
>>>>>>
>>>>>>
>>>> http://www.crosswire.org/study/wordsearchresults.jsp?searchTerm=Nesl+sv%C5%AFj+k%C5%99%C3%AD%C5%BE
>>>>>> the result says that there is
>>>>>>> 0 result in the text of Czech Ekumenicky Cesky preklad<
>>>>>> even the searched text was copied & pasted directly from it.
>>>>>>
>>>>>> I hope, it neads only the minor repair only, while the search
>>>>>> gives
>>>> good
>>>>>> results while looking for the phrases w/o Czech specific letters
>>>>>>
>>>>>> Wish: the search default is "exact match" hence:
>>>>>>> Co jsem napsal, napsal< gives result
>>>>>> but
>>>>>>> co jsem napsal, napsal< gives 0 result
>>>>>> As people use the search to help their poor memory, I wish to
>>>>>> realy
>>>> help
>>>>>> them with less "censorious" matching criteria. These can be
>>>>>> useful in
>>>> the
>>>>>> "Advanced search".
>>>>>>
>>>>>> God helps to your "Opus Dei"
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> sword-support mailing list
>>>>>> sword-support at crosswire.org
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel
mailing list