[sword-devel] search failing in Hebrew modules
Troy A. Griffitts
scribe at crosswire.org
Thu Jul 30 21:25:22 MST 2009
Regarding languages with diacritics, accents, cantillation, etc...
The SWORD M.O. is to have one set of StripFilters that massage both:
o the body of the text being searched
o the target search string
so we can get sane results.
With Greek we've been fairly intentional to strip accents and ms markup
from both module and search text input for our searching. I would bet
we still have some last minute code added somewhere which does special
things if we're in a Greek text-- obviously this should be remedied. I
doubt we've done the same for Hebrew. e.g., I would bet unaccented
Greek searches would work fine in SWORDweb, but consonant-only Hebrew
searches would not work. In anycase, the proper way to make things work
is to have appropriate StripFilter entries in the wlc.conf, and to be
sure Xiphos is calling module.StripText(userInputSearchText) before
calling SWORD's search mechanism to be sure we're comparing equivalent
texts.
Does this make sense?
-Troy.
Troy A. Griffitts wrote:
> SWORDWeb seems to work fine. I'd appreciate it if we could have
> construction fact input instead of useless statements like "it's SWORD's
> fault". Thanks.
>
> http://crosswire.org/study/wordsearchresults.jsp?searchTerm=שָׁמָיִם
>
> Anyone willing to put the time into investigating if proper UTF-8 is
> being sent into the SWORD engine from the copy and paste from Xiphos?
>
> -Troy.
>
>
>
> Matthew Talbert wrote:
>>> I don't know for sure if this is the same bug, but I know that CLucene
>>> has severe issues (read: complete inability) with Unicode support. If
>>> you are using a CLucene indexed module, this could definitely be a
>>> contributing factor to the problem. In BibleTime we don't use SWORD's
>>> search features, we re-implement that ourselves with CLucene, and our
>>> result is a similar problem with Unicode modules that have indecies.
>>
>> The searches work nearly the same for indexed and non-indexed
>> searches, so it's SWORD, not clucene. I would be interested in hearing
>> what Unicode issues clucene has. The only one I recall is the
>> inability to prefix a search with a wildcard (which is very useful for
>> languages such as French).
>>
>> Matthew
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel
mailing list