[sword-devel] Thai and Lucene
Chris Little
chrislit at crosswire.org
Tue Feb 15 02:28:42 MST 2005
Adrian Korten wrote:
> g'day,
>
> I've been wondering whether Thai would benefit from Lucene. Even if it
> does support utf-8, I doubt that Lucene supports Thai when no word
> breaks are provided. Even if it had smarts to handle Thai word-breaking
> like ICU, it would stumble over the Biblical words. Soooo, I haven't
> tried it.
Hopefully someone who actually knows what Lucene indexes will answer
this better (and especially correct me if I'm wrong), but I expect
Lucene would benefit Thai searching somewhat because it can search
within words, not just on full words. (By 'words' here, I'm using the
definition of "words" in French: anything with whitespace on both sides.)
We also probably could pass text through the ICU Thai word-break
iterator to add surrounding whitespace before we hand it to the Lucene
indexer. Anyone more knowledgable know whether that would work (on the
Lucene side).
> Is Lucene indexing primarily aimed at speeding up access to OSIS coded
> text files? Or would it also work with the other formats? I've kept the
> Thai modules in 'gbf' format to keep the file sizes down and search
> speeds slightly faster.
Indexing works on Bible modules, regardless of format. Commentaries
should work too. GenBooks didn't work last I tried and I haven't tried
any dictionaries.
--Chris
More information about the sword-devel
mailing list