[sword-devel] Normalization?

Troy A. Griffitts scribe at crosswire.org
Wed Aug 31 09:37:02 MST 2011



On 31/08/11 07:47, DM Smith wrote:
> Troy, User's typically input decomposed text for a search request.
> The module is typically composed text. When creating a lucene index
> is the text decomposed and then stripped? (I don't remember seeing
> that in the code.)

Yes the strip filters are run during lucene index creation.  If the
module has a decomposition strip filter added, then it will be run.
This is the designed way to handled the issue.

For Greek, Hebrew, and Arabic we have special logic to strip accents and
pointing.
http://crosswire.org/svn/sword/trunk/src/modules/swmodule.cpp
(see "ccent")
This is not ideal and should be moved to strip filter logic.

The example given in the thread I referenced in my last email, and which
is probably tiresome because I keep posting it is:

A search using unaccented search term (μακαρ) over Greek inscriptions
containing critical annotation:

http://crosswire.org/study/wordsearchresults.jsp?searchTerm=%CE%BC%CE%B1%CE%BA%CE%B1%CF%81&mod=PHI_CHR

Notice the search string: μακαρ,
and the matches:

μακάρ
μ[ακαρ]
Μακαρ
μακαρ
μ]ακαρ
μακα[ρ]

etc.

Also, the search term: Μάκαρ,
yields the same 33 hits:

http://crosswire.org/study/wordsearchresults.jsp?searchTerm=%CE%9C%E1%BD%B1%CE%BA%CE%B1%CF%81

If anything, this is a module configuration issue and a frontend policy
issue-- if they do not all use the suggestion to process user search
input before sending to the engine.

I have considered forcing this logic by placing it into the search
method itself, but I worry if it might take away the option of some
searches.  I've leaned toward making it a recommended policy for
frontends for now.

Troy



> 
> DM
> 
> On Aug 31, 2011, at 9:21 AM, Troy A. Griffitts wrote:
> 
>> Quickly before posting, this data is not entirely accurate.
>> 
>> I've posted this a number of times and hope frontends have taken
>> this to heart.
>> 
>> SWORD has the concept of preparing a text for searching. Modules
>> can add StripFilters to do whatever preparation they want to do for
>> searching. SWModule makes this processing available for not just
>> the module text, but also for any buffer that might want to be
>> prepared exactly the same way (SWModule::StripText) It is highly
>> recommended that frontend developers use this method on the user
>> inputted search term.
>> 
>> http://www.crosswire.org/pipermail/mobile-devel/2010-May/000121.html
>>
>>
>>
>> 
On 31/08/11 05:55, David Haslam wrote:
>>> Thanks DM.
>>> 
>>> The responses in this thread are really informative. Could we
>>> post them somewhere in the wiki, please?
>>> 
>>> David
>>> 
>>> -- View this message in context:
>>> http://sword-dev.350566.n4.nabble.com/Normalization-tp3779484p3780893.html
>>>
>>> 
Sent from the SWORD Dev mailing list archive at Nabble.com.
>>> 
>>> _______________________________________________ sword-devel
>>> mailing list: sword-devel at crosswire.org 
>>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>>> Instructions to unsubscribe/change your settings at above page
>> 
>> _______________________________________________ sword-devel mailing
>> list: sword-devel at crosswire.org 
>> http://www.crosswire.org/mailman/listinfo/sword-devel Instructions
>> to unsubscribe/change your settings at above page
> 
> 
> _______________________________________________ sword-devel mailing
> list: sword-devel at crosswire.org 
> http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to
> unsubscribe/change your settings at above page



More information about the sword-devel mailing list