[sword-devel] Codepages (Was: SWORD ISO CDROM IMAGE)

Kristof Petr sword-devel@crosswire.org
Tue, 08 Feb 2000 15:50:50 +0100


"Troy A. Griffitts" wrote:

[..]

> would like to look into such a filter, you will find other filters as
> good examples in sword/src/modules/filters.  A simple const char
> array[256] filled with appropriate translation should work great (or
> maybe I don't understand CZ encoding-- is it 2-byte?)

[..]

O.K.

I will try to explain (in simple form) some problems related to international
texts.
Maybe it helps someone in future.

There is well known ASCII. A order of 127 single-byte character set.
Its good for english texts, but other languages uses some special  characters
(= characters with added components like caron, acute, ...)

So "International Organization for Standardization" made definition of new
character sets. They are describes different sets of 256 single-bytes characters
called ISO-8859-x, where number x points to aggregate of countries.

ISO8859-1 means west europe: German, French, Italian, ...
ISO8859-2 means midle europe: Croatian, Polish,  Hungarian, Czech, ...
ISO8859-5 means Cyrillic
ISO8859-7 means Greek

After that some (unnamed evil imperialist american software monopol) company
take these character sets, changed order of some characters and used it in their
OS-like systems. ISO8859-1 was became CP1252, ISO8859-2 -> CP1250, etc...
The result is pretty damned. You cannt exchange text files between different
systems without re-encoding.

The capital letter S with caron has dec position 138 in CP1250 but 169 in
ISO8859-2
for example. The situation is similiar in all character sets. Sometimes no
important
characters was changed, sometimes more.

The shorteye way is modify software to do converting text by own priprietary way.
Glibc provide functions to change codepages, for example. But this way is
obsolete.
It brings some more complex problems in some cases.

Suggested way is Unicode. Windoze, linux kernel, glibc, Qt-2.0, KDE2, Mozilla,
XFree86 4.0,
all they are using unicode like their native format, AFAIK.

If you will use unicode, you will have
- one same module for all platforms (Win, unix, mac,...)
- one same font for all platforms (nice on greek, hebrew,...)

and it will simplify the life in great step.

The converting texts to Unicode is extreme easy. Just cat file
over pipe to some filter, for example GNU recode. You only
must know the original encoding:
cat file | recode CPxxx..UTF-8 > /dev/console

But Im expect the real problem is somewhere else.
Is possible to switch sword and swords utils to unicode in future?
You did mentioned the good design of sword architecture. ;-)

Troy, Im not trying to deflate your work. Users, who did had problems
with incorrect modules on Linux (include me), hacked own versions
and are happy.

Petr

Some links if someone interest in:
http://www.cl.cam.ac.uk/~mgk25/unicode.html <UTF-8 and Unicode FAQ for Unix/Linux>

http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html <Unicode fonts and tools for X11>
http://www.whizkidtech.net/i18n/ <Whiz Kid Technomagic i18n Tools>