[jsword-devel] invalid characters

Joe Walker joe at eireneh.com
Sun Jun 13 20:36:58 MST 2004

DM Smith wrote:
> Joe,
> I saw that you added logic to strip out characters that are invalid in 
> SwordBook.java. There is a range that you missed. There is a no mans 
> land in latin-1 and in UTF-8 which is not used and is invalid in XML. 
> Unfortunately, Microsoft has claimed that region with special characters 
> and these are often inserted my Microsoft products.
> They are valid in cp1250 and in cp1252, both of which are MS encodings. 
> These are commonly called Latin-1, but they are not.
> These are the decimal values of the code points that are not valid in 
> latin-1 (which is a proper subset of UTF-8):
> 0-8,11-31
>   (9 is tab and 10 is return and probably should be replaced by a space)
> 127-159
>   (These are commonly used by Microsoft for things like smart-quotes)
> The numbers 255 and higher are outside of latin-1.

It occurs to me that replacing with space is a better plan all round - 
it's faster and less likely to break the text.


More information about the jsword-devel mailing list