[sword-devel] Thai and Lucene

DM Smith dmsmith555 at yahoo.com
Tue Feb 15 05:48:46 MST 2005


The analyzer that is used to index a module must be used as the analyzer 
to parse the search request. The analyzer that Sword is currently using 
is for English. As part of the distribution of Lucene, there are 
analyzers for Russian and German. Also, in lucene's beta sandbox there 
are analyzers for a few other languages. If Sword uses different 
analyzer's for different modules then that will need to be stored 
against the module (kinda like defining a font for a particular module). 
If indexes are prebuilt and downloadable, then adding it to the conf  is 
a consideration.

The analyzer consists of various filters (e.g. lowcase filter, stop word 
filter, stemming filter, punctuation filter) and a tokenizer. These do 
differ by locale, sometimes in subtle ways. One obvious way is that the 
"stop" word list (words that are not indexed) differ by language. So, 
pre-filtering the query would not work.

So the "lucene" way of doing things is to write analyzers and not 
pre-filters. The analyzers could be written using ICU.

Chris Little wrote:

>
>
> Adrian Korten wrote:
>
>> g'day,
>>
>> I've been wondering whether Thai would benefit from Lucene. Even if 
>> it does support utf-8, I doubt that Lucene supports Thai when no word 
>> breaks are provided. Even if it had smarts to handle Thai 
>> word-breaking like ICU, it would stumble over the Biblical words. 
>> Soooo, I haven't tried it.
>
>
> Hopefully someone who actually knows what Lucene indexes will answer 
> this better (and especially correct me if I'm wrong), but I expect 
> Lucene would benefit Thai searching somewhat because it can search 
> within words, not just on full words. (By 'words' here, I'm using the 
> definition of "words" in French: anything with whitespace on both sides.)
>
> We also probably could pass text through the ICU Thai word-break 
> iterator to add surrounding whitespace before we hand it to the Lucene 
> indexer. Anyone more knowledgable know whether that would work (on the 
> Lucene side).
>
>> Is Lucene indexing primarily aimed at speeding up access to OSIS 
>> coded text files? Or would it also work with the other formats? I've 
>> kept the Thai modules in 'gbf' format to keep the file sizes down and 
>> search speeds slightly faster.
>
>
> Indexing works on Bible modules, regardless of format. Commentaries 
> should work too. GenBooks didn't work last I tried and I haven't tried 
> any dictionaries.



More information about the sword-devel mailing list