[jsword-devel] invalid characters

DM Smith dmsmith555 at yahoo.com
Sun Jun 13 17:27:09 MST 2004


Joe,

I saw that you added logic to strip out characters that are invalid in 
SwordBook.java. There is a range that you missed. There is a no mans 
land in latin-1 and in UTF-8 which is not used and is invalid in XML. 
Unfortunately, Microsoft has claimed that region with special characters 
and these are often inserted my Microsoft products.

They are valid in cp1250 and in cp1252, both of which are MS encodings. 
These are commonly called Latin-1, but they are not.

These are the decimal values of the code points that are not valid in 
latin-1 (which is a proper subset of UTF-8):
0-8,11-31
   (9 is tab and 10 is return and probably should be replaced by a space)
127-159
   (These are commonly used by Microsoft for things like smart-quotes)

The numbers 255 and higher are outside of latin-1.

DM


More information about the jsword-devel mailing list