[sword-devel] Chinese lucene problem

Sun Oct 7 16:21:38 MST 2012

SWORD uses an English analyzer (StandardAnalyzer) that works well for Latin-1 languages and for languages that bear some passing similarity to English (e.g. spaces between words, phonetic spelling, ...), but it does not do well with others.

The Lucene project has a few Chinese analyzers. Basically they do bi-gram indexing, every pair of letters is indexed. So the string ABCD would create 3 bi-grams, AB, BC, and CD. One of these analyzers is quite big and it might not be prudent to deliver it as part of the non-Chinese front-end.

For JSword, we use the language code as supplied in the conf to vector into the selection of the best analyzer. There are specialized analyzers for a dozen lanugages. Each one of them as pecularities that the StandardAnalyzer does not address properly. E.g. Thai does not have spaces for word breaks.

In Him,
	DM

On Oct 7, 2012, at 6:34 PM, Karl Kleinpaste <karl at kleinpaste.org> wrote:

> We've got a bug report in Xiphos saying that Chinese modules can't be
> searched well with CLucene indices.
> 
> https://sourceforge.net/p/gnomesword/bugs/488/
> 
> I know nothing at all about Chinese, and can't address this.  Can anyone
> supply some info?
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page