[sword-devel] GlobalOptionFilter=UTF8GreekAccents and non-Greek modules

Troy A. Griffitts scribe at crosswire.org
Tue Feb 21 16:36:30 MST 2017


Well, hypothetically, we might be able to make a reasonable attempt to 
teach the filter when to strip by determine which adjacent character an 
accent might be modifying and conditionally strip or not strip, but 
pragmatically, this filter is used to remove Greek accents while 
searching Greek texts, enabling us to perform accent-insensitive 
searching of the base material.  I suppose it might be useful, if a text 
interspersed occasional Greek, to allow accent-insensitive searching of 
any of those interspersed Greek words, but without a compelling use case 
now, I'm not too inclined to improve the filter to be that much more 
logical.  If someone else sees a compelling need, by all means add the 
extra heuristic logic.  I don't consider this a bug right now.  The 
filter is meant to be used on UTF-8 Greek text.  I certainly receive the 
suggestion it might be used in another case, as you suggest, and would 
consider that an improvement, if we ever have a solid use case.

Troy


On 02/21/2017 03:12 PM, DM Smith wrote:
> Hypothetical: What about mixed language texts such as a Greek/French 
> lexicon?
>
> DM
>
>> On Feb 21, 2017, at 4:56 PM, Troy A. Griffitts <scribe at crosswire.org 
>> <mailto:scribe at crosswire.org>> wrote:
>>
>>
>> Simply don't use the UTF-8 Greek Accent filter on non-Greek texts. As 
>> you have discovered there are accents used in Greek which are also 
>> used in other languages and adverse effects will be seen for these 
>> languages. The bottom line is simple. Only use the UTF-8 Greek 
>> Accents filter on UTF-8 Greek texts.
>>
>> Hope this helps.
>>
>> On February 21, 2017 2:45:24 PM MST, David Haslam 
>> <dfhmch at googlemail.com <mailto:dfhmch at googlemail.com>> wrote:
>>
>>     These are the principal diacritics found in Biblical Greek that have to be
>>     removed with a UTF8GreekAccents filter.
>>
>>     The first five are general accents, not particular to Greek.
>>     It's on account of these that the filter should not be applied to non-Greek
>>     text.
>>
>>     U+0300 ̀ COMBINING GRAVE ACCENT
>>     U+0301 ́ COMBINING ACUTE ACCENT
>>     U+0308 ̈ COMBINING DIAERESIS
>>     U+0313 ̓ COMBINING COMMA ABOVE
>>     U+0314 ̔ COMBINING REVERSED COMMA ABOVE
>>     U+0342 ͂ COMBINING GREEK PERISPOMENI
>>     U+0343 ̓ COMBINING GREEK KORONIS
>>     U+0344 ̈́ COMBINING GREEK DIALYTIKA TONOS
>>     U+0345 ͅ COMBINING GREEK YPOGEGRAMMENI
>>
>>     No other diacritics or characters should be removed.
>>     Though there are a few more combining accents in this block, they aren't
>>     really used in Biblical Greek.
>>     I am open to correction on this point.
>>
>>     e.g. The right single quotation mark (U+2019) is NOT a diacritic. It should
>>     not be removed.
>>
>>     Before any of these accents can be removed, they must first be separated
>>     from the Greek letters they are combined with.
>>
>>     Although normalization to the decomposed form can produce this effect, as we
>>     have seen already, this can have undesirable side effects on any non-Greek
>>     text in the module that may happen to include combined or unusual
>>     characters.
>>
>>     It would therefore be more sensible to simply use a comprehensive mapping
>>     table that replaces each possible accented character by the corresponding
>>     letter in the Greek alphabet. In this way the filter can completely avoid
>>     the need to apply any Unicode normalization.
>>
>>     The complete mapping table would have at least 130 rows. It will need to
>>     take into account that there are at least 75 possible combinations of a
>>     letter with two accents. There are none with three.
>>
>>     Any residual combining characters should also be removed, to cover the
>>     possibility that a module may have been intentionally made without
>>     normalizing the Greek source text by default to NFC.
>>
>>     That's my proposal. I can easily create such a mapping table that
>>     programmers can use.
>>     I can also readily test it with a bespoke TextPipe filter.
>>
>>
>>     Best regards,
>>
>>     David
>>
>>
>>
>>
>>
>>     --
>>     View this message in context:http://sword-dev.350566.n4.nabble.com/GlobalOptionFilter-UTF8GreekAccents-and-non-Greek-modules-tp4656719p4656765.html
>>     Sent from the SWORD Dev mailing list archive atNabble.com <http://nabble.com/>.
>>
>>     ------------------------------------------------------------------------
>>
>>     sword-devel mailing list:sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>     Instructions to unsubscribe/change your settings at above page
>>
>> -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
>> _______________________________________________ sword-devel mailing 
>> list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org> 
>> http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to 
>> unsubscribe/change your settings at above page
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20170221/912dce80/attachment.html>


More information about the sword-devel mailing list