[sword-devel] module making problem - U_INVALID_CHAR_FOUND

Chris Little chrislit at crosswire.org
Wed Apr 13 06:37:43 MST 2005


DM Smith wrote:
> I am not entirely sure that it is a bug in ICU. I think it is a "feature".

I didn't say it was a bug, but an error. It is an error message being 
printed to cerr.

I'm unclear as to WHY ICU is printing an error message, since I can't 
think of when it would actually get to process data coming from an IMP 
file. But the export matches the import (for me), so data isn't being 
mangled. Hence I don't believe it's an issue at all. None of the 
importers do encoding conversions.

> ICU does not recognize any valid characters in the reserved ranges of an 
> encoding. (Not sure I am using proper terminology here.) For example 
> ISO-8859-1 (aka Latin 1) identifies everything between 128 and 159 as 
> undefined. However, this range is used by cp1250 (and other cp125x and 
> cp1521), which are Microsofts variants on ISO8859. Many people 
> mistakenly refer to cp1250 as Latin-1. It is not.
> 
> Many of the non UTF-8 modules contain non Latin-1 characters. When 
> converted to UTF-8, it will fail. And when coming back to Latin-1, it 
> will not be present.

Sword modules and .conf files come in exactly two different encodings: 
UTF-8 and Codepage 1252. If a module is encoded as UTF-8, it is noted in 
the encoding line of the .conf. If there is no encoding line, the module 
is Codepage 1252.

There are various places in the library where we may refer to Latin-1, 
but what is always meant is "Codepage 1252" (not "ISO-8859-1"). The same 
goes for discussion on the list. If we talk about Latin-1 in connection 
with Sword, we really mean Codepage 1252.

> If we were to identify to the conversion routine what encoding was used, 
> then it might work. I say might, because I ran across a few OSes that 
> did not have the MS encodings on them. (e.g. IBM mainframe, Sun Solaris 
> at least through 7, early versions of Linux [ but have not looked lately 
> ]).

Modern Linux definitely carries CP1252. Many other vendors rename CP1252 
to things like "ibm-1252" before using them on their systems. In this 
case, Sword knows how to convert CP1252 to UTF-8 and can also use ICU 
(which is also capable of CP1252 conversions). But, again, none of the 
importers are actually doing encoding conversions.

--Chris



More information about the sword-devel mailing list