[sword-devel] Thai and Lucene
DM Smith
dmsmith555 at yahoo.com
Tue Feb 15 05:48:46 MST 2005
The analyzer that is used to index a module must be used as the analyzer
to parse the search request. The analyzer that Sword is currently using
is for English. As part of the distribution of Lucene, there are
analyzers for Russian and German. Also, in lucene's beta sandbox there
are analyzers for a few other languages. If Sword uses different
analyzer's for different modules then that will need to be stored
against the module (kinda like defining a font for a particular module).
If indexes are prebuilt and downloadable, then adding it to the conf is
a consideration.
The analyzer consists of various filters (e.g. lowcase filter, stop word
filter, stemming filter, punctuation filter) and a tokenizer. These do
differ by locale, sometimes in subtle ways. One obvious way is that the
"stop" word list (words that are not indexed) differ by language. So,
pre-filtering the query would not work.
So the "lucene" way of doing things is to write analyzers and not
pre-filters. The analyzers could be written using ICU.
Chris Little wrote:
>
>
> Adrian Korten wrote:
>
>> g'day,
>>
>> I've been wondering whether Thai would benefit from Lucene. Even if
>> it does support utf-8, I doubt that Lucene supports Thai when no word
>> breaks are provided. Even if it had smarts to handle Thai
>> word-breaking like ICU, it would stumble over the Biblical words.
>> Soooo, I haven't tried it.
>
>
> Hopefully someone who actually knows what Lucene indexes will answer
> this better (and especially correct me if I'm wrong), but I expect
> Lucene would benefit Thai searching somewhat because it can search
> within words, not just on full words. (By 'words' here, I'm using the
> definition of "words" in French: anything with whitespace on both sides.)
>
> We also probably could pass text through the ICU Thai word-break
> iterator to add surrounding whitespace before we hand it to the Lucene
> indexer. Anyone more knowledgable know whether that would work (on the
> Lucene side).
>
>> Is Lucene indexing primarily aimed at speeding up access to OSIS
>> coded text files? Or would it also work with the other formats? I've
>> kept the Thai modules in 'gbf' format to keep the file sizes down and
>> search speeds slightly faster.
>
>
> Indexing works on Bible modules, regardless of format. Commentaries
> should work too. GenBooks didn't work last I tried and I haven't tried
> any dictionaries.
More information about the sword-devel
mailing list