[sword-devel] RTFHTML filter bugs

Jaak Ristioja jaak at ristioja.ee
Thu May 22 03:33:32 MST 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 21.05.2014 20:05, Greg Hellings wrote:
>> The RTFHTML filter code appears to incorrectly parse the
>> following strings:
>> 
>> "\u-999999" -> getUTF8FromUniChar(48577) "\u-99999" ->
>> getUTF8FromUniChar(31073) "\u-0001" -> getUTF8FromUniChar(65535) 
>> "\u-00" -> getUTF8FromUniChar(0) "\u-0" -> getUTF8FromUniChar(0) 
>> "\u00" -> getUTF8FromUniChar(0) "\u001" -> getUTF8FromUniChar(1) 
>> "\u99999" -> getUTF8FromUniChar(34463) "\u-" ->
>> getUTF8FromUniChar(0) "\u--" -> getUTF8FromUniChar(0) "\u--2" ->
>> getUTF8FromUniChar(0) "\u-a" -> getUTF8FromUniChar(0)
>> 
>> I think all these should instead fail.
> 
> The last three should return -, 2, and a respectively if I read the
> wiki page correctly that allows a final character to use when the
> conversion otherwise won't work.

Ok, I missed the final character thing in both the wiki and the RTF
spec. With respect to our context, the RTF spec points that \u{num}
should immediately followed by an CP1252 character with, optionally, a
keyword-delimiting space between the control word and that character.
If another control word or control symbol follows, it is considered to
be that character.

But I think you are incorrect. "\u--", "\u--2" and "\u-a" should fail
because there is NO signed number. Just "-" is NOT a number. Neither
is "--2", which is an expression at best. I don't see any good reason
why we should allow stuff like "\u------2" or "\u++-+-++-0".

> Why you think the signed values that are zero prefixed should fail
> I don't understand.

Because 0-prefixed representations are in most contexts considered to
be octal.

> Those which fall beyond the range of a sixteen bit integer are the
> only ones I might agree should fall.  However, since Unicode now
> exceeds sixteen bits, think it is our limitation that ought to
> change.

We can't change our limitation, because then we don't have RTF any
more. And as I understand you want backwards compatibility.

>> 4) \par incorrectly appends a newline.
> 
> Why is a newline incorrect? Newlines are mostly ignored in HTML.

There is good no reason to append one. Its useless extra data. Even
for debugging purposes. Or, as a counterexample: why not instead
append a pattern of 3 spaces followed a newline repeated 4 times? - it
would be ignored anyway. IMHO, it lacks usefulness.

>> 5) "a\qc b" is converted to "a<center> b", but should instead be 
>> "a<center>b</center>" (' ' RTF delimiter output, missing HTML 
>> </center> tag)
>> 
>> 6) "a\par b" is converted to "a<p/> b", but should probably be 
>> "<p>a</p><p>b</p>" (' ' RTF delimiter output, missing HTML <p>
>> and </p> tags.
>> 
>> 7) Weird combinations of \par, \pard and \qc result in broken
>> HTML fragments or HTML fragments with unbalanced start and end
>> tags.
> 
> I don't believe the contract of this filter guarantees valid HTML,
> and HTML allows unbalanced tags.  In fact it is preferred in some
> older HTML specs for certain tags,  p a prominent example of such
> tags.

IMHO you're wrong. At least it is not a valid XHTML (XML) or HTML 5
balanced fragment. I'm not completely sure about earlier HTML
standards. The HTML 5 draft provides a guide on how to handle such
invalid cases, but these are not considered "valid". And as such, one
of two things should happen:
  1) we should output valid HTML,
  2) users of RTFHTML must fix the output or at least ensure the
output is placed properly, so it doesn't interfere with any HTML
places after the output.

>> 8) Unsupported control sequences do not cause the function to
>> fail, but are passed to output as plain text (including the
>> backslash).
>> 
>> 8) Unescaped '{', '}' and '\' characters are not handled properly
>> (to pass these from RTF one would need to use the control symbols
>> "\{", "\}" and "\\" respectively).
> 
> 
> The rest of your objections seem to be based on a different
> objective than SWORD filter objectives. The prose is not to force
> compliance to a strict spec but instead to give a "best effort"
> attempt at conversion. The same way that most browsers will accept
> invalid input but make a best effort to display (unescaped &
> characters will usually display as is and invalid nesting such as
> having a div inside of a p tag still works out somewhat reasonably)
> the SWORD engine is lax in what it accepts.
> 
> It follows the general maxim "be strict in what you produce but
> lenient in what you accept." Crosswire produced content should not
> include such invalid input, but the engine is intentionally written
> to make a best attempt to handle innocuous invalid input. This is
> because we want to encourage as many people as possible to use the
> engine even if they are not strict in what they produce.
> 
>> If there are existing modules with bad content or in places where
>> the filters are producing invalid output we should fix it, but we
>> don't need to go and get stringent about the conversion throwing
>> errors or the like because of an invalid control sequence or an
>> unknown Unicode character.

I have two big with these arguments. First, we have the current
implementation RTFHTML and the wiki page documenting its behaviour,
and these don't match. The current implementation does NOT properly
parse valid RTF. For example, it fails to take into account RTF
delimeters, e.g. "\u8364 ab" should output "€b" if the unicode
character can be displayed, and "ab" otherwise. But currently (I
think) it outputs getUTF8FromUniChar(8364) followed by " ab". Both
ignoring the following character and the delimiter.

Secondly, I've not seen a piece of documentation about accepting lax
behaviour. Even if we do accept invalid input, we should do it in a
safe and secure manner which does not result in invalid output.
Additionally, there should be a switch whether "best effort"
processing on lax input is allowed. Currently RTFHTML always succeeds
without a warning.

Blessings,
Jaak
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQgcBAEBAgAGBQJTfdJ5AAoJELozJlbjIn791TxAAJeiwVhpWdOIfKqcxm3l0R1P
GqDUB05kqsfNtoOMPZmZb8raKuPw6kol0ysN7e9xLMGdcd0Gp8/rxPY3O/WXhyOk
RWSKYIGpPXyAE4ZMTz9O96jMQsJILxHTGwxmtogwD9XBotLrp4nlnbv4CNpkKd60
qNU9V/hCim2c/qMOokwwUZ5LskKV/5hENO560xdGOMM2ItfUpnBATb+wQQ0CBIba
ZGddbAEb+aVZDWEBjajurVtCH9jlcwf7iyCbh54f+UQ8KmgBhE4rZUVHcoy49s16
sEFCQ8wOAcovJCI6+py5W49TneR8eCIGojLGFyvzFjt5pvXZCZkJV/tchkuxQxpR
nswvNpD9DefWul8IsTlYy31zhJZu7FDvB2cB2p9z+pD+ndz/Zox5XrKOaWZRVm9u
xmCqWB/K0jl2M4RwCvNgaafjDUTasOW7z+qU5d/n44/ohJA8w+GekYSyJPUUIaXJ
0Ki3OCZVAryJnQZqO1oTiFnpWDionQFsLvQJil8JzGbKLhwiUUXJb5Dx9dGIMBEa
QVtZsIBjRn1PjM3PVpJ5pdMYNhPuaygLRskafZ6WfwsAr17mOTE8m/MBb6cqcQQA
vE7RYYr7qUFi2aDCmSK62xO7eR1QwgIFJW46QPeeuHNhu15KNtsueOOGgqHepChU
qK+XnigHZevCrV05K1cNc/F3M2HiunUL+6RgbY9/kBOOvN2xMo8lAowxwawcn6fY
1jvz3t8npf3s6Ki7BV+ccktdhyVVFKuobuARrBWcE34H8weRfXqdgsUOUmy3V7jk
fGGbc9YlWxy8VDu4gbE2hjyw7aGVOf8tMy4Q0SbgkkBd6AmGBkVJOl5MZET4/EsU
QQ/2c35WTkjodLjLbV0j2/TaOCdzq4D8A5g2Vxt8lPV1LNjG4MqT/8XmrNsQDoDu
Rih1p0DricY9TQrLpvsyZfvIiT2fjTkOiW1jYWdJmtm15Ud3FWkbJw6BMyFV5NTJ
tz7531Ku/JRrHKfxg2feSWX37ZfdhIshUs+xD3NPQTDlWrmS527/+dDI5EBd+D+L
7obK0HU/bLV+nXN+U42GsmtJyh5MgEGmmHJDXOPTKoSmawl5v6WxFtbZD641XKzu
pG3HzePbK9GDYJvKAWfFlnnRBmQnD7IUKJzh8Rh8A0VdC1qKOYlMpMZ8/MkB+Ak5
Rxgz/NVOhFewe0f+2sLE1OCYUQChfcZEBW9WEd+rUu70CKejZjRBJF/HiwjzyzTD
qy0SmfwbK1NTRR0409berIjIxNTCpN5OBQ+1upYtUQe90o4pjFlwTX1B/In+u5ox
5u8iTKFv1vqF7+7oFQIymZQl8IHSEkBoPL8HmZxD+DI++J4cSbRzIkDAoNagQias
NU2/MYhI0kMS5Kfpe5twwwZ8kp0dxmY8wK/5bEEHKKfI9oBxmNSYWD8Swp/TiEyM
7lpkUmOL8hKVDUHxaeodDsZo3SmnnHr5F++ltbrGzvT9YnEQ6ai2oBetfm6zD5Zd
1n7/JlHmbqhn4w5U07njP9wN2pCvRG8v5xllE+wEDBrZNgHNWY6I6rKIn3+D0Rpj
xg7YmgLrtNoRWfBm0H3RI27CsmvIQx2fzrxfgeZ/LbnjFAwy3w26UwoSYvtGFnr8
VFZ3VMOzn3U7Ttm8yE6xrumHaxN7WOQRJ3/YRN6fgOpV8Hx4bWkxAUi9wK7Qz77V
by0YzdrEt4pTxr/sruUbH3HafvKWSPtQGoT03CWp5ozDBsAq1ESqP68RWoqgJv4d
OLah93mnr/2rBQGTfR3Yg32rl9azdUBFnlzokpboYpIblT/s2R+jrADHkvejEG7W
VLbmxwgOw+xyxalRQgczbZqEFSpH3ZVE1yKVgm61eBoWGgnXq9ppd621phZfVvNF
xaCmO3uInc1KSpeuj50kaiMRdIrQv938AkXRjx7K+6QZ3y/GuV6U6Z6MJ+JcF37t
B7JHhm/QmituX5FOykul4GvJbepoZTe1meztmEuxiT3S6LjBnKHRSIBvNq9r0Gzt
tHZWGc0wPpHddwUE8qqnIHI5LPVVII0GpxLv7DDgyQVe/oDgZb4kvTc3PjNXVD42
6AB2rPEeJWI/yH1JNukhnAh6IgofvgYXR4AYlr8j0395uwbSRdjNxDpgqJKu3nVA
KGcGF7MXLdDZzU973XUUTlvPzT4lch2GWtiCNTcjAaLTo4fm3KJVxDQeDHYN4NvO
YkYps/pmDmaXcvVQq6Kv2i4spjN5pwky5Jf75XkBMTqzbLvvH/9uzM4A+qZJj8tN
sPGBbyYOrvdksNccXVgdQC1joOKRfbXetnGsLElvkDCCekZZlF5eDWFt+TdlMSMQ
2QRq6TgV4nC+1Jfx6J/uLdWioBU//kJM5PEYOrEDoKRX4+ZebGC4k1fTwLd4mu/6
gWJQ0oIO4duesOv0Rtb0/JETuevgUzDcIUTBRAww/0+wnmD3dyDAZE3WAIlMcVyC
k+8IWq7LS3OO3/LXcFyQHbUs8nLnrQH/9cTPCDZJkltSOuHY9ZO1mkVL6Z6GILJj
zIPGcarNijagw1WagcAh1mDfK9SIi+AgYgcXc8Lcv+QNDogY1sBKc+9godHoXFKL
47ssnce4gnLFo8I13Epin8aUKucQ++OyPpv5LpxcqVKqCyeddXLH3gp927l5/sBT
RZNich38Eg520YKpN8FKisguwi4IW7u+NylJITVDeQ93eqx9m90d+7OMVieN147u
bxb2y04zhJz8pqKM73eS
=GRPO
-----END PGP SIGNATURE-----



More information about the sword-devel mailing list