[sword-devel] Lucene search index and Coptic ?

Troy A. Griffitts scribe at crosswire.org
Wed Apr 26 16:35:31 MST 2017


So, as a side note to this thread,

The Sahidic Bible is maintained at coptot.manuscriptroom.com:

http://coptot.manuscriptroom.com/transcribing?docID=1620025&userName=PUBLISHED

and we regularly export from there and import into swordweb, which is
used for their browser plugin (first link on Christian Askeland's wonder
resource list for Coptic):

https://sites.google.com/site/askelandchristian/copticlinks

We don't index the text.  They typically search with regex (and yes,
they know about the {byte_count} anomaly with our regex search).

-Troy



On 04/26/2017 03:21 PM, DM Smith wrote:
> Consider using Luke to analyze the constructed Lucene index.
> See: https://code.google.com/archive/p/luke/
> I think you’ll need one that matches Lucene 1.9.1. Maybe 1.4.x.
>
> DM
>
>
>> On Apr 26, 2017, at 3:48 PM, David Haslam <dfhmch at googlemail.com
>> <mailto:dfhmch at googlemail.com>> wrote:
>>
>> If you examine the result preview pane in the Xiphos Advanced Search
>> dialog,
>> the problem becomes apparent.
>>
>> Most Coptic Unicode characters are not displayed correctly.
>>
>>
>>
>> The remainder seem to have been converted to U+FFFD REPLACEMENT
>> CHARACTER.
>>
>> i.e. All these Coptic letters are basically not handled aright by
>> this part
>> of the software:
>>
>> U+2C81ⲁCOPTIC SMALL LETTER ALFA
>> U+2C83ⲃCOPTIC SMALL LETTER VIDA
>> U+2C85ⲅCOPTIC SMALL LETTER GAMMA
>> U+2C87ⲇCOPTIC SMALL LETTER DALDA
>> U+2C89ⲉCOPTIC SMALL LETTER EIE
>> U+2C8BⲋCOPTIC SMALL LETTER SOU
>> U+2C8DⲍCOPTIC SMALL LETTER ZATA
>> U+2C8FⲏCOPTIC SMALL LETTER HATE
>> U+2C91ⲑCOPTIC SMALL LETTER THETHE
>> U+2C93ⲓCOPTIC SMALL LETTER IAUDA
>> U+2C95ⲕCOPTIC SMALL LETTER KAPA
>> U+2C97ⲗCOPTIC SMALL LETTER LAULA
>> U+2C99ⲙCOPTIC SMALL LETTER MI
>> U+2C9BⲛCOPTIC SMALL LETTER NI
>> U+2C9DⲝCOPTIC SMALL LETTER KSI
>> U+2C9FⲟCOPTIC SMALL LETTER O
>> U+2CA1ⲡCOPTIC SMALL LETTER PI
>> U+2CA3ⲣCOPTIC SMALL LETTER RO
>> U+2CA5ⲥCOPTIC SMALL LETTER SIMA
>> U+2CA7ⲧCOPTIC SMALL LETTER TAU
>> U+2CA9ⲩCOPTIC SMALL LETTER UA
>> U+2CABⲫCOPTIC SMALL LETTER FI
>> U+2CADⲭCOPTIC SMALL LETTER KHI
>> U+2CAFⲯCOPTIC SMALL LETTER PSI
>> U+2CB1ⲱCOPTIC SMALL LETTER OOU
>> U+2CC1ⳁCOPTIC SMALL LETTER SAMPI
>> U+2CE8⳨COPTIC SYMBOL TAU RO
>>
>> Only the few Coptic letters in the block U+03E2 to U+03EF are displayed
>> aright.
>>
>> It's no wonder that a search has so many spurious results if most of the
>> search space has been squashed into Unicode replacement characters.
>>
>> I'm a Windows user, as most of you know already.
>> Does the same thing happen in Xiphos under Linux?
>>
>> Is this an issue common to all SWORD based front-ends?
>> The fact that we see similar results in PocketSword strongly suggests
>> it is.
>>
>> Best regards,
>>
>> David
>>
>>
>>
>> --
>> View this message in context:
>> http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657106.html
>> Sent from the SWORD Dev mailing list archive at Nabble.com
>> <http://Nabble.com>.
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> <mailto:sword-devel at crosswire.org>
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20170426/6fdfacb0/attachment.html>


More information about the sword-devel mailing list