[jsword-devel] Too good to be true

DM Smith dmsmith555 at yahoo.com
Sun Jun 20 20:03:25 MST 2004


Joe,

The loop to replace bad chars with blanks does not work as expected. 
Took me a while to remember that bytes (8 bits) are not chars (32 bits).

Anyway, when the encoding of the byte array is UTF-8, each character is 
encoded in one or two bytes. And it is either BE or LE (Big/Little 
Endian, and of course, Micro$oft does it backward of everyone else). The 
upshot is that it is a pain to write a loop that figures out whether it 
is BE or LE and then how many bytes comprise each character. You cannot 
merely look for bytes that are <32 (and so forth) because it might be 
<0, which indicates it *is* part of a character, or it *might be* part 
of a character.

I started reading on how Java does it and poked around in the code (at 
least until it disappeared into Sun's private implementation). The New 
IO package "nio" has all the support to identify and replace "bad" 
characters. It might be the *right* way to solve the problem, but I 
think it is overkill and I don't have the time to look into it.

The simplest solution is to convert the byte array into a String and 
then fix up the characters that are not good. I am working on fixing it.

DM


More information about the jsword-devel mailing list