For the moment I think I will remove Chinese search from the next release, try to understand it better, and then add it in again in a later release. If I just go for the first analyzer that seems to work it would be difficult to change analyzer, if necessary, in the future because it would invalidate the previous indexes used, but I have added index.properties to downloaded indexes so maybe I should add analyzer to that.<div>
<br></div><div>Matthew has been helping me test Chinese search and here is a recent comment:</div><div><br></div><blockquote class="webkit-indent-blockquote" style="margin: 0 0 0 40px; border: none; padding: 0px;"><div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; ">"I wonder why there needs to be an analyzer exactly? Can a search not be simply performed based on single unicode characters? What is the analyzer doing, anyway? I understand that an analyzer would be useful for the program to know what characters are used together as "words", but is it really necessary when single characters can be looked up?</span></div>
<div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><div><br></div></span></div><div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><div>
Also, something that comes to mind: If no analyzer is used, but instead searches are done on single characters, would it not be possible for the user to use an "AND" search function (ex., Éñ AND °®, for ideas contained in a verse; or, Ñá AND ¶ñ, for a "word" contained in a verse) to link multiple characters together? I don't quite understand why any dictionary file would be needed in such a case... ?"</div>
</span></div></blockquote><div><span class="Apple-style-span" style="font-family: arial, sans-serif; font-size: 13px; border-collapse: collapse; "><div><br></div><div>Kind regards</div><div>Martin</div></span><br><div class="gmail_quote">
On 11 November 2010 23:38, DM Smith <span dir="ltr"><<a href="mailto:dmsmith@crosswire.org">dmsmith@crosswire.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div style="word-wrap:break-word">Also, there is a difference between simplified and traditional Chinese. Make sure that you are using the correct characters.<div><br></div><font color="#888888"><div>-- DM</div></font><div>
<div></div><div class="h5"><div><br><div><div>On Nov 11, 2010, at 6:37 PM, DM Smith wrote:</div><br><blockquote type="cite"><div style="word-wrap:break-word">One of the current shortcomings of JSword is that it does not normalize to NFC before indexing. (there are multiple equivalent representations of Unicode, but unless the index and the search request agree on what is used, you don't get the right answer).<div>
<br></div><div>If you are able to view the text and copy from it, copy a word or two and then paste that into the search box. If that works then you are hit with this problem. BTW, I know it is a problem with Farsi. And probably with most </div>
<div><br></div><div>-- DM</div><div><br><div><div>On Nov 11, 2010, at 6:29 PM, Martin Denham wrote:</div><br><blockquote type="cite">I tried switching to the cn.ChineseAnalyzer and this one has no out-of-memory problems but it does not return any results. It returns instantly almost as if it is doing nothing. <div>
<br></div><div>I regenerated the index then I tried to search for 'Mark' in Chinese which I pasted it in from BibleNames_zh.properties and got no results after 0 seconds.<div>
<br></div><div>I might try the CJKAnalyzer tomorrow.</div><div><br></div><div>Best regards</div><div>Martin</div><div><br><br><div class="gmail_quote">On 11 November 2010 22:16, DM Smith <span dir="ltr"><<a href="mailto:dmsmith@crosswire.org" target="_blank">dmsmith@crosswire.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Martin,<br>
<br>
In the lucene-analyzers jar try either: (let org.apache.lucene.analysis be o.a.l.a)<br>
o.a.l.a.cn.ChineseAnalyzer or o.a.l.a.cjk.CJKAnalyzer<br>
The latter searches bigrams and thus has a bigger index size.<br>
<br>
Hope this helps.<br>
<div><br>
In Him,<br>
DM<br>
<br>
On Nov 11, 2010, at 3:54 PM, Martin Denham wrote:<br>
<br>
</div><div><div></div><div>> Does anybody know if there is a Chinese Lucene Analyzer that is more lightweight than smartcn or if it is possible to configure smartcn to use less memory?<br>
><br>
> Smart Chinese Analyzer will not run on Android because it attempts to load up a large dictionary in order to split phrases and runs out of memory. Here is a stack trace:<br>
><br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): java.lang.ExceptionInInitializerError<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.cn.smart.hhmm.HHMMSegmenter.process(HHMMSegmenter.java:201)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.cn.smart.WordSegmenter.segmentSentence(WordSegmenter.java:50)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.cn.smart.WordTokenFilter.incrementToken(WordTokenFilter.java:69)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.PorterStemFilter.incrementToken(PorterStemFilter.java:53)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.StopFilter.incrementToken(StopFilter.java:225)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:87)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:61)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:599)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1449)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1337)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.crosswire.jsword.index.lucene.LuceneIndex.find(Unknown Source)<br>
> <deleted a bit of the stack trace here><br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): Caused by: java.lang.OutOfMemoryError<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at java.lang.reflect.Array.newInstance(Array.java:492)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at java.io.ObjectInputStream.readNewArray(ObjectInputStream.java:1637)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at java.io.ObjectInputStream.readNonPrimitiveContent(ObjectInputStream.java:927)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at java.io.ObjectInputStream.readObject(ObjectInputStream.java:2285)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at java.io.ObjectInputStream.readObject(ObjectInputStream.java:2240)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.loadFromInputStream(BigramDictionary.java:99)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.load(BigramDictionary.java:120)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.cn.smart.hhmm.BigramDictionary.getInstance(BigramDictionary.java:71)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): at org.apache.lucene.analysis.cn.smart.hhmm.BiSegGraph.<clinit>(BiSegGraph.java:46)<br>
> 11-11 20:38:28.296: ERROR/AndroidRuntime(8925): ... 35 more<br>
><br>
> For now I will have to disable searching in Chinese texts.<br>
><br>
> Kind regards<br>
> Martin<br>
><br>
><br>
</div></div><div><div></div><div>> _______________________________________________<br>
> jsword-devel mailing list<br>
> <a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br>
> <a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
<br>
<br>
_______________________________________________<br>
jsword-devel mailing list<br>
<a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
</div></div></blockquote></div><br>
</div></div>
_______________________________________________<br>jsword-devel mailing list<br><a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br><a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
</blockquote></div><br></div></div>_______________________________________________<br>jsword-devel mailing list<br><a href="mailto:jsword-devel@crosswire.org" target="_blank">jsword-devel@crosswire.org</a><br><a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
</blockquote></div><br></div></div></div></div><br>_______________________________________________<br>
jsword-devel mailing list<br>
<a href="mailto:jsword-devel@crosswire.org">jsword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/jsword-devel" target="_blank">http://www.crosswire.org/mailman/listinfo/jsword-devel</a><br>
<br></blockquote></div><br></div>