[sword-devel] Re: Westcott-Hort

Costas I. Stergiou sword-devel@crosswire.org
Mon, 5 Apr 2004 11:57:28 +0300 (EET DST)


Hi Chris,
> In terms of combining characters vs. precomposed, all you really need to
> do is to remember to use a single normalization form.  Unicode sort of
> informally suggests that NFC is best.  W3C specifically recommends using
> NFC (see http://www.w3.org/TR/charmod-norm/).  Roughly, NFC
> normalization consists of taking a string, decomposing all characters,
> then combining any codepoints that can be combined, provided the
> precombined codepoints are not compatability codepoints.  The way to
> ensure that a string is NFC normalized is to just normalize it with
> something like the uconv program I mentioned.
>
> I really don't know whether Extended Greek is NFC or not.  So the last
> step before creating the Sword module should be normalization.

Actually, the NFC standard is all about precomposed chars. All the
extended
greek chars are exactly this: the (pre-composed) greek letters with the
diacriticals. I use icu4j for all my tests & conversions and when
asking to take
a text and convert it to NFC it does use the extended greek chars. So, my
almost certain answer, is yes (extended greek is NFC)

Actually, the problem that most greek accented texts have is that they
use some diacriticals that they are not combining-diacriticals. The
visual result may be the same, but when trying to convert to NFC they are
left as they are. But this is wrong because there are precomposed
characters that would nicely replace these. The issue is that the unicode
set provides many ways for greek text to 'look' the same. This is what I
am trying to correct to some texts (including the WH) which tends to use
(at some points) diacriticals that are not combining! I think this is the
result of scanning.

What I do (and I think is correct, any thoughts here?) is take the greek
text, decompose it (icu4j->NFD), replace all non-combining diacriticals
with combining ones (and change their order so they can be normalized
correctly) and NFC it again. The result should be a text with
ONLY extended greek characters (and NO stand-alone diacriticals AT ALL).
After doing this the NFC->NFD->NFC gives back the same text.

Any comments/corrections on the above is highly welcome,
In Christ,
Costas