[sword-devel] GlobalOptionFilter=UTF8GreekAccents
DM Smith
dmsmith at crosswire.org
Mon Mar 17 16:21:49 EDT 2025
This search result is exactly what I’d expect with the current code as it has been described and shows that the filter is being used.
The search request is normalized to remove all the accents and also the U+2019. The text is normalized in the same fashion. So it works with or without U+2019. It might do other normalizations.
This is with both slow and fast search. BTW, Lucene can do exact phrase search. So I presume that you mean slow search.
Edit the conf to remove the filter and try it again. I expect the fully accented search to work, but probably not the others.
If you delete the Lucene index and rebuild it after making the conf change, it might no longer work as expected. I think Lucene has its own normalizers, so it might work. I’m pretty sure it doesn’t have a custom Greek analyzer but uses a Latin-1 analyzer.
If the code change is made then I expect that U+2019 would be required in the search, in the same fashion that only proper spelling is found, as it’d be treated as a letter.
The Xiphos display weirdness is a separate issue.
Let me note, that JSword only has Lucene searches and uses Lucene's Greek analyzer that does noise word (aka stop word) elimination, stemming, case folding, accent elimination, unicode normalization and maybe more. The analyzer is used for both the search request and the building of the index.
DM
> On Mar 17, 2025, at 3:31 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>
> btw. The same 2 results were obtained by a search for "δι ημερων".
>
> i.e. Without the U+2019 in the search key.
>
> Best regards,
>
> David
>
> Sent with Proton Mail <https://pr.tn/ref/SWXT9A5YZ67G> secure email.
>
> On Monday, March 17th, 2025 at 6:46 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>> Hi DM,
>>
>> With Xiphos 4.3.1 (latest update) when I searched TischMorph either for "δι’ ἡμερῶν", or for "δι’ ημερων", there were 2 results:
>> Mark 2:1
>> Acts 1:3
>> Search results were no different with the Greek Accents on or off. I therefore conclude that your hunch was incorrect!
>>
>> Aside:
>> After an exact phrase search, both results preview correctly.
>> After a Lucene fast search, both results preview really weirdly <https://www.dropbox.com/scl/fi/msw6s8dl4au5z0optwm5l/Screenshot-2025-03-17-18.43.04.png?rlkey=wps1isdrh9h1atdck6r7ihbol&dl=0> & weirdly <https://www.dropbox.com/scl/fi/4aiyelopdy1a1gjlpto5f/Screenshot-2025-03-17-18.44.12.png?rlkey=bc1qmql18faoti9b6o6o27qeu&dl=0> !!! I think this should be reported to Karl K. Might it be a software bug?
>>
>> Best regards,
>>
>> David
>>
>> Sent with Proton Mail <https://pr.tn/ref/SWXT9A5YZ67G> secure email.
>>
>> On Monday, March 17th, 2025 at 6:17 PM, DM Smith <dmsmith at crosswire.org> wrote:
>>> David,
>>> I’m not sure that the filter is only used for display. I think it may also be used for search. In Ancient Greek, we don’t want to have to include U+2019 as part of the search request, but just the letters.
>>>
>>> As a reader of NT Greek, it doesn’t bother me to have δ αρχαια rather than δ’ αρχαια.
>>>
>>> BTW, if the filter’s code is changed and if the filter is used for searches, then all indexes of accented NT Greek modules will need to be rebuilt. The user’s search request has to be normalized in exactly the same way as the index was constructed.
>>>
>>> DM
>>>
>>>> On Mar 17, 2025, at 11:44 AM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>
>>>> Hi DM,
>>>>
>>>> One impact is on the StatResGNT module, in which both single and double left/right quotation marks have been added by the project leader.
>>>> Hiding Greek Accents has the bad effect of losing the end quotation mark for all the level 2 quotations in the text.
>>>> NB. It was seeing this project that prompted me to revisit this topic.
>>>> It would be a real benefit to this module to make the change that I proposed.
>>>>
>>>> Further to my initial thoughts late last week, I now agree that U+2019 is the right codepoint choice to mark an elision.
>>>> I was somewhat misled by the wrong answer given by Leo AI, which mistakenly told me that it was a way to represent the iota subscript.
>>>> It's only since quizzing Grok AI that my thoughts have become clear. I admit that I should've known better, but I'm not a classicist.
>>>> Yet the "category mistake" still exists - since an elision marker is not a diacritic. And by definition, a Greek Accent is a diacritic!
>>>>
>>>> Making the proposed change to the filter should have a minimal effect upon all the other Ancient Greek Bible modules.
>>>> The number of words thus affected in a Greek NT module is not huge!
>>>> There's really no downside to still displaying the "typographical apostrophe".
>>>>
>>>> To illustrate, these are the only 21 words in TischMorph that end with U+2019.
>>>> Word Count
>>>> Δι’ 2
>>>> Κατ’ 1
>>>> δ’ 22
>>>> δι’ 142
>>>> καθ’ 61
>>>> κατ’ 82
>>>> μεθ’ 43
>>>> μετ’ 132
>>>> μηδ’ 1
>>>> οὐδ’ 8
>>>> παρ’ 59
>>>> τοῦτ’ 17
>>>> ἀλλ’ 220
>>>> ἀνθ’ 5
>>>> ἀπ’ 119
>>>> ἀφ’ 44
>>>> Ἀλλ’ 1
>>>> ἐπ’ 143
>>>> ἐφ’ 82
>>>> ὑπ’ 25
>>>> ὑφ’ 9
>>>>
>>>> It's now my considered view that even when the Greek accents are hidden by the filter, the elision marks ought to be retained.
>>>>
>>>> Best regards,
>>>>
>>>> David
>>>>
>>>> Sent with Proton Mail <https://pr.tn/ref/SWXT9A5YZ67G> secure email.
>>>>
>>>> On Monday, March 17th, 2025 at 3:06 PM, DM Smith <dmsmith at crosswire.org> wrote:
>>>>> David, I read your Grok 3 analysis.
>>>>>
>>>>> What is the impact of not having this change? What is the impact of making the change? Is it merely presentation of is there an issue with searching too?
>>>>>
>>>>> I’ve also been reading https://corp.unicode.org/pipermail/unicode/2019-January/007563.html which was referenced in a prior recent thread on U+2019 in Ancient Greek. This is long and worth reading to understand how it might impact SWORD. The thread is initiated by James Tauber.
>>>>>
>>>>> TL;DR:
>>>>> U+2019 (and in older texts U+0027) in Ancient Greek was never used for quotations and is only used for elision. It is considered the recommended character for elisions.
>>>>> The Unicode rules (when the thread was written in January 2019) of TR29 have that U+2019 is a word break when at the front or end of a word, but not within a word. It is not simply punctuation. These rules are not language aware.
>>>>> There is no zero width character in Unicode to join words.
>>>>> It is impossible for TR29 to distinguish between U+2019 used as a quotation mark and as an elision.
>>>>> There is no other character that is an appropriate replacement for U+2019.
>>>>>
>>>>> I haven’t yet looked at Unicode TR30 regarding folding rules as it pertains to this.
>>>>>
>>>>> In Him,
>>>>> DM
>>>>>
>>>>>
>>>>>> On Mar 17, 2025, at 8:46 AM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>>>
>>>>>> Dear SWORD developers,
>>>>>>
>>>>>> I asked about this topic several years ago, and I'm no longer convinced by what we were told back then.
>>>>>>
>>>>>> After doing further research, it's my understanding that U+2019 RIGHT SINGLE QUOTATION MARK ought not to be hidden by this SWORD filter.
>>>>>>
>>>>>> This codepoint is not a diacritic that modifies the previous Greek letter. In other words, it's not a Greek accent.
>>>>>> This codepoint has the Unicode properties of a punctuation mark.
>>>>>> In Ancient Greek text, it's used to mark an elision, where the final vowel of a word is omitted when the next word begins with a vowel.
>>>>>>
>>>>>> To view my research, conducted with the help of Grok 3, please visit the following link.
>>>>>> https://grok.com/share/bGVnYWN5_43ff1922-3876-4d9a-9e42-6ae940007fd0
>>>>>>
>>>>>> I therefore recommend that SWORD developers revisit the specification for this filter, and update it so that U+2019 is never hidden.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> David
>>>>>>
>>>>>> Sent with Proton Mail <https://pr.tn/ref/SWXT9A5YZ67G> secure email.
>>>>>> _______________________________________________
>>>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>> http://crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>>
>>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20250317/ba980a25/attachment-0001.htm>
More information about the sword-devel
mailing list