[jsword-devel] invalid characters
Joe Walker
joe at eireneh.com
Sun Jun 13 20:36:58 MST 2004
DM Smith wrote:
> Joe,
>
> I saw that you added logic to strip out characters that are invalid in
> SwordBook.java. There is a range that you missed. There is a no mans
> land in latin-1 and in UTF-8 which is not used and is invalid in XML.
> Unfortunately, Microsoft has claimed that region with special characters
> and these are often inserted my Microsoft products.
>
> They are valid in cp1250 and in cp1252, both of which are MS encodings.
> These are commonly called Latin-1, but they are not.
>
> These are the decimal values of the code points that are not valid in
> latin-1 (which is a proper subset of UTF-8):
> 0-8,11-31
> (9 is tab and 10 is return and probably should be replaced by a space)
> 127-159
> (These are commonly used by Microsoft for things like smart-quotes)
>
> The numbers 255 and higher are outside of latin-1.
Thanks.
It occurs to me that replacing with space is a better plan all round -
it's faster and less likely to break the text.
Joe.
More information about the jsword-devel
mailing list