[sword-devel] DevTools:ICU & Normalization?

David Haslam dfhmch at googlemail.com
Wed Oct 12 08:29:06 MST 2011


According to http://crosswire.org/wiki/DevTools:ICU - Sword makes use of ICU
for casing (used in search), normalization, and script transliteration.

*Which version of Unicode do we employ for Normalization to NFC ?*

Some composite glyphs that use two combining characters in the *Myanmar*
block are treated differently when specifying the current version of Unicode
than they were for Unicode 3.2.

These are the two combining characters.  They have UNC codes U+1037 U+103A.

့ MYANMAR SIGN DOT BELOW
် MYANMAR SIGN ASAT

This pair of combining characters occurs many, many times in the BurJudson
module.

Software that includes Normalization should be tested against the official
Unicode Normalization Test
http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt (2.2MB) for that
version of Unicode,

Testing the normalization of the sequence U+1000 U+103A U+1037 with the ICU
Normalization Browser (which uses the "Internationalization Components for
Unicode" library, which is the most widely used Unicode software library),
we can verify that it does indeed normalize to U+1000 U+1037 U+103A, with
reordering:

See http://bit.ly/nqYzQp.

However, if you run the same test for Unicode 3.2 (released March 2002, and
so almost 10 years out of date), there is no reordering:

See http://bit.ly/orZ7df.

/NB. I used the URL shortener to allow parameters to be passed to the test
page more easily/.

The process of converting a string to NFC or NFD requires a stage called
"canonical ordering", whereby characters are reordered in ascending order
according to their canonical combining class [ccc]. See
http://www.unicode.org/reports/tr15/?win#Description_Norm.

U+103A MYANMAR SIGN ASAT has ccc=9, whereas U+1037 MYANMAR SIGN DOT BELOW
has ccc=7; therefore U+1037 is reordered before U+103A.

The present module BurJudson has SwordVersionDate=2008-03-01. 
It looks very much as if the normalization was done according to Unicode
3.2.

Context:
This question arises in the context of the possibility of creating a new
module from a better source text.
If we use the latest SWORD utilities to make the new module, will it
normalize correctly?

David

--
View this message in context: http://sword-dev.350566.n4.nabble.com/DevTools-ICU-Normalization-tp3898398p3898398.html
Sent from the SWORD Dev mailing list archive at Nabble.com.



More information about the sword-devel mailing list