[sword-devel] module making problem - U_INVALID_CHAR_FOUND

Wed Apr 13 15:50:01 MST 2005

Thanks Chris for the clarification. I did not find anywhere on the 
website where it is mentioned that Latin-1 means cp1252. Silly me for 
assuming that it meant what the ISO board meant it to be and not what MS 
co-opted it for.

In one of the archived messages, it mentioned that the filters always go 
to UTF-8 first. Is this the case with imp2ld? I saw that it is using the 
module code to do the writing, but I did not dig further to see where it 
would be.

I just searched the archives and I see that as early as 2000 there is a 
desire to migrate all modules to UTF-8. After we release JSword 1.0, is 
this something that I can help with?

Chris Little wrote:

> DM Smith wrote:
>
>> I am not entirely sure that it is a bug in ICU. I think it is a 
>> "feature".
>
>
> I didn't say it was a bug, but an error. It is an error message being 
> printed to cerr.
>
> I'm unclear as to WHY ICU is printing an error message, since I can't 
> think of when it would actually get to process data coming from an IMP 
> file. But the export matches the import (for me), so data isn't being 
> mangled. Hence I don't believe it's an issue at all. None of the 
> importers do encoding conversions.
>
>> ICU does not recognize any valid characters in the reserved ranges of 
>> an encoding. (Not sure I am using proper terminology here.) For 
>> example ISO-8859-1 (aka Latin 1) identifies everything between 128 
>> and 159 as undefined. However, this range is used by cp1250 (and 
>> other cp125x and cp1521), which are Microsofts variants on ISO8859. 
>> Many people mistakenly refer to cp1250 as Latin-1. It is not.
>>
>> Many of the non UTF-8 modules contain non Latin-1 characters. When 
>> converted to UTF-8, it will fail. And when coming back to Latin-1, it 
>> will not be present.
>
>
> Sword modules and .conf files come in exactly two different encodings: 
> UTF-8 and Codepage 1252. If a module is encoded as UTF-8, it is noted 
> in the encoding line of the .conf. If there is no encoding line, the 
> module is Codepage 1252.
>
> There are various places in the library where we may refer to Latin-1, 
> but what is always meant is "Codepage 1252" (not "ISO-8859-1"). The 
> same goes for discussion on the list. If we talk about Latin-1 in 
> connection with Sword, we really mean Codepage 1252.
>
>> If we were to identify to the conversion routine what encoding was 
>> used, then it might work. I say might, because I ran across a few 
>> OSes that did not have the MS encodings on them. (e.g. IBM mainframe, 
>> Sun Solaris at least through 7, early versions of Linux [ but have 
>> not looked lately ]).
>
>
> Modern Linux definitely carries CP1252. Many other vendors rename 
> CP1252 to things like "ibm-1252" before using them on their systems. 
> In this case, Sword knows how to convert CP1252 to UTF-8 and can also 
> use ICU (which is also capable of CP1252 conversions). But, again, 
> none of the importers are actually doing encoding conversions.
>
> --Chris