<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>So, as a side note to this thread,</p>
<p>The Sahidic Bible is maintained at coptot.manuscriptroom.com:</p>
<p><a class="moz-txt-link-freetext" href="http://coptot.manuscriptroom.com/transcribing?docID=1620025&userName=PUBLISHED">http://coptot.manuscriptroom.com/transcribing?docID=1620025&userName=PUBLISHED</a></p>
<p> and we regularly export from there and import into swordweb,
which is used for their browser plugin (first link on Christian
Askeland's wonder resource list for Coptic):</p>
<p><a class="moz-txt-link-freetext" href="https://sites.google.com/site/askelandchristian/copticlinks">https://sites.google.com/site/askelandchristian/copticlinks</a></p>
<p>We don't index the text. They typically search with regex (and
yes, they know about the {byte_count} anomaly with our regex
search).</p>
<p>-Troy</p>
<p><br>
</p>
<br>
<div class="moz-cite-prefix">On 04/26/2017 03:21 PM, DM Smith wrote:<br>
</div>
<blockquote type="cite"
cite="mid:9E0CC3C8-CA45-4C81-A1C0-1962CD77ECA8@crosswire.org">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Consider using Luke to analyze the constructed Lucene index. See: <a
href="https://code.google.com/archive/p/luke/" class=""
moz-do-not-send="true">https://code.google.com/archive/p/luke/</a>
<div class="">I think you’ll need one that matches Lucene 1.9.1.
Maybe 1.4.x.</div>
<div class=""><br class="">
</div>
<div class="">DM</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
<div>
<blockquote type="cite" class="">
<div class="">On Apr 26, 2017, at 3:48 PM, David Haslam <<a
href="mailto:dfhmch@googlemail.com" class=""
moz-do-not-send="true">dfhmch@googlemail.com</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div class="">If you examine the result preview pane in
the Xiphos Advanced Search dialog,<br class="">
the problem becomes apparent.<br class="">
<br class="">
Most Coptic Unicode characters are not displayed
correctly.<br class="">
<br class="">
<br class="">
<br class="">
The remainder seem to have been converted to U+FFFD
REPLACEMENT CHARACTER.<br class="">
<br class="">
i.e. All these Coptic letters are basically not handled
aright by this part<br class="">
of the software:<br class="">
<br class="">
U+2C81<span class="Apple-tab-span" style="white-space:pre">        </span>ⲁ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER ALFA<br class="">
U+2C83<span class="Apple-tab-span" style="white-space:pre">        </span>ⲃ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER VIDA<br class="">
U+2C85<span class="Apple-tab-span" style="white-space:pre">        </span>ⲅ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER GAMMA<br class="">
U+2C87<span class="Apple-tab-span" style="white-space:pre">        </span>ⲇ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER DALDA<br class="">
U+2C89<span class="Apple-tab-span" style="white-space:pre">        </span>ⲉ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER EIE<br class="">
U+2C8B<span class="Apple-tab-span" style="white-space:pre">        </span>ⲋ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER SOU<br class="">
U+2C8D<span class="Apple-tab-span" style="white-space:pre">        </span>ⲍ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER ZATA<br class="">
U+2C8F<span class="Apple-tab-span" style="white-space:pre">        </span>ⲏ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER HATE<br class="">
U+2C91<span class="Apple-tab-span" style="white-space:pre">        </span>ⲑ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER THETHE<br class="">
U+2C93<span class="Apple-tab-span" style="white-space:pre">        </span>ⲓ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER IAUDA<br class="">
U+2C95<span class="Apple-tab-span" style="white-space:pre">        </span>ⲕ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER KAPA<br class="">
U+2C97<span class="Apple-tab-span" style="white-space:pre">        </span>ⲗ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER LAULA<br class="">
U+2C99<span class="Apple-tab-span" style="white-space:pre">        </span>ⲙ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER MI<br class="">
U+2C9B<span class="Apple-tab-span" style="white-space:pre">        </span>ⲛ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER NI<br class="">
U+2C9D<span class="Apple-tab-span" style="white-space:pre">        </span>ⲝ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER KSI<br class="">
U+2C9F<span class="Apple-tab-span" style="white-space:pre">        </span>ⲟ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER O<br class="">
U+2CA1<span class="Apple-tab-span" style="white-space:pre">        </span>ⲡ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER PI<br class="">
U+2CA3<span class="Apple-tab-span" style="white-space:pre">        </span>ⲣ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER RO<br class="">
U+2CA5<span class="Apple-tab-span" style="white-space:pre">        </span>ⲥ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER SIMA<br class="">
U+2CA7<span class="Apple-tab-span" style="white-space:pre">        </span>ⲧ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER TAU<br class="">
U+2CA9<span class="Apple-tab-span" style="white-space:pre">        </span>ⲩ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER UA<br class="">
U+2CAB<span class="Apple-tab-span" style="white-space:pre">        </span>ⲫ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER FI<br class="">
U+2CAD<span class="Apple-tab-span" style="white-space:pre">        </span>ⲭ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER KHI<br class="">
U+2CAF<span class="Apple-tab-span" style="white-space:pre">        </span>ⲯ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER PSI<br class="">
U+2CB1<span class="Apple-tab-span" style="white-space:pre">        </span>ⲱ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER OOU<br class="">
U+2CC1<span class="Apple-tab-span" style="white-space:pre">        </span>ⳁ<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SMALL LETTER SAMPI<br class="">
U+2CE8<span class="Apple-tab-span" style="white-space:pre">        </span>⳨<span class="Apple-tab-span" style="white-space:pre">        </span>COPTIC
SYMBOL TAU RO<br class="">
<br class="">
Only the few Coptic letters in the block U+03E2 to
U+03EF are displayed<br class="">
aright.<br class="">
<br class="">
It's no wonder that a search has so many spurious
results if most of the<br class="">
search space has been squashed into Unicode replacement
characters.<br class="">
<br class="">
I'm a Windows user, as most of you know already.<br
class="">
Does the same thing happen in Xiphos under Linux?<br
class="">
<br class="">
Is this an issue common to all SWORD based front-ends?<br
class="">
The fact that we see similar results in PocketSword
strongly suggests it is.<br class="">
<br class="">
Best regards,<br class="">
<br class="">
David<br class="">
<br class="">
<br class="">
<br class="">
--<br class="">
View this message in context: <a
href="http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657106.html"
class="" moz-do-not-send="true">http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657106.html</a><br
class="">
Sent from the SWORD Dev mailing list archive at <a
href="http://Nabble.com" class=""
moz-do-not-send="true">Nabble.com</a>.<br class="">
<br class="">
_______________________________________________<br
class="">
sword-devel mailing list: <a
href="mailto:sword-devel@crosswire.org" class=""
moz-do-not-send="true">sword-devel@crosswire.org</a><br
class="">
<a
href="http://www.crosswire.org/mailman/listinfo/sword-devel"
class="" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br
class="">
Instructions to unsubscribe/change your settings at
above page</div>
</div>
</blockquote>
</div>
<br class="">
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</body>
</html>