[jsword-devel] A smaller Chinese Lucene Analyzer

Fri Nov 12 03:03:23 MST 2010

For the moment I think I will remove Chinese search from the next release,
try to understand it better, and then add it in again in a later release.
 If I just go for the first analyzer that seems to work it would be
difficult to change analyzer, if necessary, in the future because it would
invalidate the previous indexes used, but I have added index.properties to
downloaded indexes so maybe I should add analyzer to that.

Matthew has been helping me test Chinese search and here is a recent
comment:

"I wonder why there needs to be an analyzer exactly? Can a search not be
simply performed based on single unicode characters? What is the analyzer
doing, anyway? I understand that an analyzer would be useful for the program
to know what characters are used together as "words", but is it really
necessary when single characters can be looked up?

Also, something that comes to mind: If no analyzer is used, but instead
searches are done on single characters, would it not be possible for the
user to use an "AND" search function (ex., 神 AND 爱, for ideas contained in a
verse; or, 厌 AND 恶, for a "word" contained in a verse) to link multiple
characters together? I don't quite understand why any dictionary file would
be needed in such a case... ?"

Kind regards
Martin

On 11 November 2010 23:38, DM Smith <dmsmith at crosswire.org> wrote:

> Also, there is a difference between simplified and traditional Chinese.
> Make sure that you are using the correct characters.
>
> -- DM
>
> On Nov 11, 2010, at 6:37 PM, DM Smith wrote:
>
> One of the current shortcomings of JSword is that it does not normalize to
> NFC before indexing. (there are multiple equivalent representations of
> Unicode, but unless the index and the search request agree on what is used,
> you don't get the right answer).
>
> If you are able to view the text and copy from it, copy a word or two and
> then paste that into the search box.  If that works then you are hit with
> this problem.  BTW, I know it is a problem with Farsi. And probably with
> most
>
> -- DM
>
> On Nov 11, 2010, at 6:29 PM, Martin Denham wrote:
>
> I tried switching to the cn.ChineseAnalyzer and this one has no
> out-of-memory problems but it does not return any results.  It returns
> instantly almost as if it is doing nothing.
>
> I regenerated the index then I tried to search for 'Mark' in Chinese which
> I pasted it in from BibleNames_zh.properties and got no results after 0
> seconds.
>
> I might try the CJKAnalyzer tomorrow.
>
> Best regards
> Martin
>
>
> On 11 November 2010 22:16, DM Smith <dmsmith at crosswire.org> wrote:
>
>> Martin,
>>
>> In the lucene-analyzers jar try either: (let org.apache.lucene.analysis be
>> o.a.l.a)
>> o.a.l.a.cn.ChineseAnalyzer or o.a.l.a.cjk.CJKAnalyzer
>> The latter searches bigrams and thus has a bigger index size.
>>
>> Hope this helps.
>>
>> In Him,
>>        DM
>>
>> On Nov 11, 2010, at 3:54 PM, Martin Denham wrote:
>>
>> > Does anybody know if there is a Chinese Lucene Analyzer that is more
>> lightweight than smartcn or if it is possible to configure smartcn to use
>> less memory?
>> >
>> > Smart Chinese Analyzer will not run on Android because it attempts to
>> load up a large dictionary in order to split phrases and runs out of memory.
>>  Here is a stack trace:
>> >
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):
>> java.lang.ExceptionInInitializerError
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process(HHMMSegmenter.java:201)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence(WordSegmenter.java:50)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken(WordTokenFilter.java:69)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.java:53)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:225)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:87)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:61)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:599)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1449)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1337)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.crosswire.jsword.index.lucene.LuceneIndex.find(Unknown Source)
>> > <deleted a bit of the stack trace here>
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): Caused by:
>> java.lang.OutOfMemoryError
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> java.lang.reflect.Array.newInstance(Array.java:492)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> java.io.ObjectInputStream.readNewArray(ObjectInputStream.java:1637)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> java.io.ObjectInputStream.readNonPrimitiveContent(ObjectInputStream.java:927)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:2285)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:2240)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.loadFromInputStream(BigramDictionary.java:99)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.load(BigramDictionary.java:120)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.getInstance(BigramDictionary.java:71)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at
>> org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.<clinit>(BiSegGraph.java:46)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     ... 35 more
>> >
>> > For now I will have to disable searching in Chinese texts.
>> >
>> > Kind regards
>> > Martin
>> >
>> >
>> > _______________________________________________
>> > jsword-devel mailing list
>> > jsword-devel at crosswire.org
>> > http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>
>  _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20101112/872550f0/attachment-0001.html>