[sword-devel] module making problem - U_INVALID_CHAR_FOUND
DM Smith
dmsmith555 at gmail.com
Wed Apr 13 15:50:01 MST 2005
Thanks Chris for the clarification. I did not find anywhere on the
website where it is mentioned that Latin-1 means cp1252. Silly me for
assuming that it meant what the ISO board meant it to be and not what MS
co-opted it for.
In one of the archived messages, it mentioned that the filters always go
to UTF-8 first. Is this the case with imp2ld? I saw that it is using the
module code to do the writing, but I did not dig further to see where it
would be.
I just searched the archives and I see that as early as 2000 there is a
desire to migrate all modules to UTF-8. After we release JSword 1.0, is
this something that I can help with?
Chris Little wrote:
> DM Smith wrote:
>
>> I am not entirely sure that it is a bug in ICU. I think it is a
>> "feature".
>
>
> I didn't say it was a bug, but an error. It is an error message being
> printed to cerr.
>
> I'm unclear as to WHY ICU is printing an error message, since I can't
> think of when it would actually get to process data coming from an IMP
> file. But the export matches the import (for me), so data isn't being
> mangled. Hence I don't believe it's an issue at all. None of the
> importers do encoding conversions.
>
>> ICU does not recognize any valid characters in the reserved ranges of
>> an encoding. (Not sure I am using proper terminology here.) For
>> example ISO-8859-1 (aka Latin 1) identifies everything between 128
>> and 159 as undefined. However, this range is used by cp1250 (and
>> other cp125x and cp1521), which are Microsofts variants on ISO8859.
>> Many people mistakenly refer to cp1250 as Latin-1. It is not.
>>
>> Many of the non UTF-8 modules contain non Latin-1 characters. When
>> converted to UTF-8, it will fail. And when coming back to Latin-1, it
>> will not be present.
>
>
> Sword modules and .conf files come in exactly two different encodings:
> UTF-8 and Codepage 1252. If a module is encoded as UTF-8, it is noted
> in the encoding line of the .conf. If there is no encoding line, the
> module is Codepage 1252.
>
> There are various places in the library where we may refer to Latin-1,
> but what is always meant is "Codepage 1252" (not "ISO-8859-1"). The
> same goes for discussion on the list. If we talk about Latin-1 in
> connection with Sword, we really mean Codepage 1252.
>
>> If we were to identify to the conversion routine what encoding was
>> used, then it might work. I say might, because I ran across a few
>> OSes that did not have the MS encodings on them. (e.g. IBM mainframe,
>> Sun Solaris at least through 7, early versions of Linux [ but have
>> not looked lately ]).
>
>
> Modern Linux definitely carries CP1252. Many other vendors rename
> CP1252 to things like "ibm-1252" before using them on their systems.
> In this case, Sword knows how to convert CP1252 to UTF-8 and can also
> use ICU (which is also capable of CP1252 conversions). But, again,
> none of the importers are actually doing encoding conversions.
>
> --Chris
More information about the sword-devel
mailing list