[jsword-devel] A smaller Chinese Lucene Analyzer

DM Smith dmsmith at crosswire.org
Thu Nov 11 16:38:59 MST 2010


Also, there is a difference between simplified and traditional Chinese. Make sure that you are using the correct characters.

-- DM

On Nov 11, 2010, at 6:37 PM, DM Smith wrote:

> One of the current shortcomings of JSword is that it does not normalize to NFC before indexing. (there are multiple equivalent representations of Unicode, but unless the index and the search request agree on what is used, you don't get the right answer).
> 
> If you are able to view the text and copy from it, copy a word or two and then paste that into the search box.  If that works then you are hit with this problem.  BTW, I know it is a problem with Farsi. And probably with most 
> 
> -- DM
> 
> On Nov 11, 2010, at 6:29 PM, Martin Denham wrote:
> 
>> I tried switching to the cn.ChineseAnalyzer and this one has no out-of-memory problems but it does not return any results.  It returns instantly almost as if it is doing nothing. 
>> 
>> I regenerated the index then I tried to search for 'Mark' in Chinese which I pasted it in from BibleNames_zh.properties and got no results after 0 seconds.
>> 
>> I might try the CJKAnalyzer tomorrow.
>> 
>> Best regards
>> Martin
>> 
>> 
>> On 11 November 2010 22:16, DM Smith <dmsmith at crosswire.org> wrote:
>> Martin,
>> 
>> In the lucene-analyzers jar try either: (let org.apache.lucene.analysis be o.a.l.a)
>> o.a.l.a.cn.ChineseAnalyzer or o.a.l.a.cjk.CJKAnalyzer
>> The latter searches bigrams and thus has a bigger index size.
>> 
>> Hope this helps.
>> 
>> In Him,
>>        DM
>> 
>> On Nov 11, 2010, at 3:54 PM, Martin Denham wrote:
>> 
>> > Does anybody know if there is a Chinese Lucene Analyzer that is more lightweight than smartcn or if it is possible to configure smartcn to use less memory?
>> >
>> > Smart Chinese Analyzer will not run on Android because it attempts to load up a large dictionary in order to split phrases and runs out of memory.  Here is a stack trace:
>> >
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): java.lang.ExceptionInInitializerError
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process(HHMMSegmenter.java:201)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence(WordSegmenter.java:50)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken(WordTokenFilter.java:69)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.java:53)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:225)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:87)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:61)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:599)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1449)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1337)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.crosswire.jsword.index.lucene.LuceneIndex.find(Unknown Source)
>> > <deleted a bit of the stack trace here>
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): Caused by: java.lang.OutOfMemoryError
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at java.lang.reflect.Array.newInstance(Array.java:492)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at java.io.ObjectInputStream.readNewArray(ObjectInputStream.java:1637)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at java.io.ObjectInputStream.readNonPrimitiveContent(ObjectInputStream.java:927)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at java.io.ObjectInputStream.readObject(ObjectInputStream.java:2285)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at java.io.ObjectInputStream.readObject(ObjectInputStream.java:2240)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.loadFromInputStream(BigramDictionary.java:99)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.load(BigramDictionary.java:120)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.getInstance(BigramDictionary.java:71)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     at org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.<clinit>(BiSegGraph.java:46)
>> > 11-11 20:38:28.296: ERROR/AndroidRuntime(8925):     ... 35 more
>> >
>> > For now I will have to disable searching in Chinese texts.
>> >
>> > Kind regards
>> > Martin
>> >
>> >
>> > _______________________________________________
>> > jsword-devel mailing list
>> > jsword-devel at crosswire.org
>> > http://www.crosswire.org/mailman/listinfo/jsword-devel
>> 
>> 
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>> 
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
> 
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20101111/2a5c30d0/attachment.html>


More information about the jsword-devel mailing list