[sword-devel] Lucene search index and Coptic ?
David Haslam
dfhmch at googlemail.com
Wed Apr 26 08:12:46 MST 2017
Comparing the results total 12460 to the number of module verses that contain
any text (14212), a search that finds the 10 letter search key in 87.67% of
the total is clearly a serious matter, one so egregious that it almost
defies a rational explanation.
Here's a possible clue.
Taking the /unique/ letters from the example search word, and inserting a
space between each, we get this:
ϩ ⲉ ⲏ ⲙ ⲛ ⲟ ⲡ ⲩ
Using this as the search key, and selecting *multi-word* search type in
Xiphos, I got 9049 results using the Advanced Search dialog.
Now although that's only 72.6% of the original number of results, or 63.67%
of the non-empty verses.
One further observation is that the results verse list starts in almost the
same way as before.
Genesis 3:10,11,14,15,16,19,20,21,...
However, with such high proportions of the non-empty verse count, this is
not so surprising.
This comparison suggests the following plausible explanation for the weird
result with Lucene.
Is the software used by the Lucene search treating each Coptic Letter as a
Word ?
i.e. Just as it should if each Unicode Symbol was an Egyptian Hieroglyph or
a Han/Hangul Ideograph.
Maybe this conjecture needs teasing out in further detail, if perhaps only
some of the Coptic Letters are misclassified.
After all, the Coptic letters in the module are from two separate Unicode
blocks.
But if this is really the root cause, then it's clearly a critical bug in
the Lucene software.
Can anyone think of a better explanation?
Best regards,
David
--
View this message in context: http://sword-dev.350566.n4.nabble.com/Lucene-search-index-and-Coptic-tp4657103p4657105.html
Sent from the SWORD Dev mailing list archive at Nabble.com.
More information about the sword-devel
mailing list