[sword-devel] Unicode

Sat, 26 May 2001 16:44:44 -0700

Martin asked me my opinion on where we should go with Unicode support,
and I thought it might be worthwhile to get a discussion going
generally.

I'm really not sure what should be done with Unicode.  It seems like
UTF-8 makes certain things very easy, but creates additional problems.

Size is one problem.  Since all accented and umlauted characters require
2-bytes in UTF-8, many roman script texts might see 5-10% increases in
size.  Non-roman scripts always take at least 2, sometimes 3 characters.
(That's what we've seen--they could go up to 6 characters according to
the spec for a character that needs all 32 bits, but 3 characters handle
a 16-bit character.) So we can expect Russian and Hebrew texts to grow
to double their former size in UTF-8 over an ISO8859 encoding.  So our
options here are:

1. Encode with UTF-8 whenever possible. (Probably a bad idea.)
2. Encode with ISO8859-1 (Latin-1) whenever possible and then UTF-8
whenever possible if ISO8859-1 won't work, which alleviates the problem
of accents & umlauts increasing in size.
3. Encode with all ISO8859 encodings and similar 8/16-bit encodings
whenever possible, using UTF-8 as a fallback when possible, which
alleviates many more module size problems.

The question is how much processing we are willing to do in Sword to
convert between encodings vs. how large we are willing to allow our
modules to become.  One thing we have in our favor is that all of these
modules can be targeted at Sword 1.5+, so we can compress them.  But a
compressed UTF-8 NA27 is still going to be larger than a NA27 encoded in
ISO8859-7.

The nicest solution may be to allow flexibility for module makers and
frontend makers by supporting texts encoded in UTF-8, ISO8859-x, etc.
and translating to the desired encoding, just as we do with different
markup filters.

There's a further issue of Unicode's incompleteness.  Harry has
mentioned there are still some issues with Hebrew support in Unicode
3.0.  There are very few fonts even made to support some of the new
glyphs in Unicode 3.0.  As an example, while making a Peshitta module
last night, I wanted to convert from a custom font encoding over to
UTF-8.  Syriac was only added in Unicode 3.0, so I only found one font
that supports its glyphs.  Even so, it appears that the Syriac
implementation in Unicode 3.0 may be incomplete for the purposes of this
text.

Why does it seem that once we scale a tall mountain, we find an even
taller mountain waiting behind it to be conquered as well?

--Chris