[sword-devel] Searching other languages

Chris Little sword-devel@crosswire.org
Thu, 29 May 2003 15:48:16 -0700 (MST)


On Thu, 29 May 2003, Troy A. Griffitts wrote:

> 	Currently the engine does not do MUCH logic when comparing string in 
> the search.  You can operate on the assumption that all modules are UTF8 
> encoded (though I don't know if absolutely ever module is), so sending a 
> UTF8 steam to the seach method should produce the appropriate results. 

Lots of modules are still Codepage 1252.  You can use the Latin1UTF8 
filter (or the logic included in it) to convert CP1252 to UTF-8.

> There will be problems with the fact that some combining character may
> be represented as a precomposed character, but ask in the search box as
> a multiple combining character-- this will not match.  But basicly, the
> answer is pass UTF8 text as the search term.

Make sure your search string is normalized according to form NFC.  (You 
can use ICU for this.  See the UTF8NFC filter for an example of how to 
achieve this.)  All modules OUGHT to be NFC already, but I doubt they are.  
So you might also want to use the UTF8NFC filter as one of your 
stripfilters.

--Chris