[jsword-devel] invalid characters
DM Smith
dmsmith555 at yahoo.com
Sun Jun 13 17:27:09 MST 2004
Joe,
I saw that you added logic to strip out characters that are invalid in
SwordBook.java. There is a range that you missed. There is a no mans
land in latin-1 and in UTF-8 which is not used and is invalid in XML.
Unfortunately, Microsoft has claimed that region with special characters
and these are often inserted my Microsoft products.
They are valid in cp1250 and in cp1252, both of which are MS encodings.
These are commonly called Latin-1, but they are not.
These are the decimal values of the code points that are not valid in
latin-1 (which is a proper subset of UTF-8):
0-8,11-31
(9 is tab and 10 is return and probably should be replaced by a space)
127-159
(These are commonly used by Microsoft for things like smart-quotes)
The numbers 255 and higher are outside of latin-1.
DM
More information about the jsword-devel
mailing list