[jsword-devel] Too good to be true
DM Smith
dmsmith555 at yahoo.com
Sun Jun 20 20:03:25 MST 2004
Joe,
The loop to replace bad chars with blanks does not work as expected.
Took me a while to remember that bytes (8 bits) are not chars (32 bits).
Anyway, when the encoding of the byte array is UTF-8, each character is
encoded in one or two bytes. And it is either BE or LE (Big/Little
Endian, and of course, Micro$oft does it backward of everyone else). The
upshot is that it is a pain to write a loop that figures out whether it
is BE or LE and then how many bytes comprise each character. You cannot
merely look for bytes that are <32 (and so forth) because it might be
<0, which indicates it *is* part of a character, or it *might be* part
of a character.
I started reading on how Java does it and poked around in the code (at
least until it disappeared into Sun's private implementation). The New
IO package "nio" has all the support to identify and replace "bad"
characters. It might be the *right* way to solve the problem, but I
think it is overkill and I don't have the time to look into it.
The simplest solution is to convert the byte array into a String and
then fix up the characters that are not good. I am working on fixing it.
DM
More information about the jsword-devel
mailing list