[sword-devel] search failing in Hebrew modules

Troy A. Griffitts scribe at crosswire.org
Thu Jul 30 21:25:22 MST 2009


Regarding languages with diacritics, accents, cantillation, etc...

The SWORD M.O. is to have one set of StripFilters that massage both:
o	the body of the text being searched
o	the target search string

so we can get sane results.

With Greek we've been fairly intentional to strip accents and ms markup 
from both module and search text input for our searching.  I would bet 
we still have some last minute code added somewhere which does special 
things if we're in a Greek text-- obviously this should be remedied.  I 
doubt we've done the same for Hebrew.  e.g., I would bet unaccented 
Greek searches would work fine in SWORDweb, but consonant-only Hebrew 
searches would not work.  In anycase, the proper way to make things work 
is to have appropriate StripFilter entries in the wlc.conf, and to be 
sure Xiphos is calling module.StripText(userInputSearchText) before 
calling SWORD's search mechanism to be sure we're comparing equivalent 
texts.

Does this make sense?

	-Troy.





Troy A. Griffitts wrote:
> SWORDWeb seems to work fine.  I'd appreciate it if we could have 
> construction fact input instead of useless statements like "it's SWORD's 
>  fault".  Thanks.
> 
> http://crosswire.org/study/wordsearchresults.jsp?searchTerm=שָׁמָיִם
> 
> Anyone willing to put the time into investigating if proper UTF-8 is 
> being sent into the SWORD engine from the copy and paste from Xiphos?
> 
>     -Troy.
> 
> 
> 
> Matthew Talbert wrote:
>>> I don't know for sure if this is the same bug, but I know that CLucene
>>> has severe issues (read: complete inability) with Unicode support.  If
>>> you are using a CLucene indexed module, this could definitely be a
>>> contributing factor to the problem.  In BibleTime we don't use SWORD's
>>> search features, we re-implement that ourselves with CLucene, and our
>>> result is a similar problem with Unicode modules that have indecies.
>>
>> The searches work nearly the same for indexed and non-indexed
>> searches, so it's SWORD, not clucene. I would be interested in hearing
>> what Unicode issues clucene has. The only one I recall is the
>> inability to prefix a search with a wildcard (which is very useful for
>> languages such as French).
>>
>> Matthew
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
> 
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page




More information about the sword-devel mailing list