[sword-devel] RTFHTML filter bugs

Greg Hellings greg.hellings at gmail.com
Wed May 21 10:05:28 MST 2014


Greg
On May 19, 2014 5:12 PM, "Jaak Ristioja" <jaak at ristioja.ee> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi!
>
> 1) According to http://www.crosswire.org/wiki/DevTools:conf_Files the
> \u control word should be followed by a 16-bit signed integer. The
> wiki page doesn't mention this, but I assume it is in ASCII in decimal
> form.

It would be either CP1252 or UTF 8 like the rest of the file.
>
> The RTFHTML filter code appears to incorrectly parse the following
> strings:
>
>   "\u-999999" -> getUTF8FromUniChar(48577)
>   "\u-99999" -> getUTF8FromUniChar(31073)
>   "\u-0001" -> getUTF8FromUniChar(65535)
>   "\u-00" -> getUTF8FromUniChar(0)
>   "\u-0" -> getUTF8FromUniChar(0)
>   "\u00" -> getUTF8FromUniChar(0)
>   "\u001" -> getUTF8FromUniChar(1)
>   "\u99999" -> getUTF8FromUniChar(34463)
>   "\u-" -> getUTF8FromUniChar(0)
>   "\u--" -> getUTF8FromUniChar(0)
>   "\u--2" -> getUTF8FromUniChar(0)
>   "\u-a" -> getUTF8FromUniChar(0)
>
> I think all these should instead fail.

The last three should return -, 2, and a respectively if I read the wiki
page correctly that allows a final character to use when the conversion
otherwise won't work.

Why you think the signed values that are zero prefixed should fail I don't
understand. Those which fall beyond the range of a sixteen bit integer are
the only ones I might agree should fall.  However, since Unicode now
exceeds sixteen bits, think it is our limitation that ought to change.

>
> 2) In case an exception is thrown, text might contain a partial result
> or the original value.
>
> 3) For control word \pard (and similarly for \par and \qc) it
> incorrectly parses \pardx as \pard and "x", where it should instead
> fail due to an invalid control word \pardx.
>
> 4) \par incorrectly appends a newline.

Why is a newline incorrect? Newlines are mostly ignored in HTML.

>
> 5) "a\qc b" is converted to "a<center> b", but should instead be
> "a<center>b</center>" (' ' RTF delimiter output, missing HTML
> </center> tag)
>
> 6) "a\par b" is converted to "a<p/> b", but should probably be
> "<p>a</p><p>b</p>" (' ' RTF delimiter output, missing HTML <p> and
> </p> tags.
>
> 7) Weird combinations of \par, \pard and \qc result in broken HTML
> fragments or HTML fragments with unbalanced start and end tags.

I don't believe the contract of this filter guarantees valid HTML, and HTML
allows unbalanced tags.  In fact it is preferred in some older HTML specs
for certain tags,  p a prominent example of such tags.

>
> 8) Unsupported control sequences do not cause the function to fail,
> but are passed to output as plain text (including the backslash).
>
> 8) Unescaped '{', '}' and '\' characters are not handled properly (to
> pass these from RTF one would need to use the control symbols "\{",
> "\}" and "\\" respectively).
>

The rest of your objections seem to be based on a different objective than
SWORD filter objectives. The prose is not to force compliance to a strict
spec but instead to give a "best effort" attempt at conversion. The same
way that most browsers will accept invalid input but make a best effort to
display (unescaped & characters will usually display as is and invalid
nesting such as having a div inside of a p tag still works out somewhat
reasonably) the SWORD engine is lax in what it accepts.

It follows the general maxim "be strict in what you produce but lenient in
what you accept." Crosswire produced content should not include such
invalid input, but the engine is intentionally written to make a best
attempt to handle innocuous invalid input. This is because we want to
encourage as many people as possible to use the engine even if they are not
strict in what they produce.

If there are existing modules with bad content or in places where the
filters are producing invalid output we should fix it, but we don't need to
go and get stringent about the conversion throwing errors or the like
because of an invalid control sequence or an unknown Unicode character.

--Greg

> Maybe I'll get around to fix this someday during daytime. To save me
> extra work, I'd appreciate any comments on this before I start any
> coding, especially if the Sword library needs to deviate from the RTF
> specifications.
>
>
> Blessings,
> Jaak
>
> PS: I'm glad there are no memory errors in this function. :)
> PPS: Please forgive me for having studied formal languages.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.22 (GNU/Linux)
>
> iQgcBAEBAgAGBQJTeoEUAAoJELozJlbjIn798A1AALYn7ogi0Q3QvLPq998aj5R8
> dMW/iAPIRPgmvrqpccTaaYxbP60E5Pm6Yf3XEFR6KkP01QQtM/v6S7Bxmmo28ewr
> En3ZMzhldHDQUXKuaP5+Rp8ndw81SjlyeVYZlQlpcm/gWBzJpjZ4CFJuePH5/iwp
> 1kn3WwRJM5mp2nejOC+JIRgL8RDvEMwowSHWFKESI//YoJzS6tWKQskGI65dWngb
> PYFzMpllpJpQhMKspDXh6sbJT43UlX/Kvh9G/JDrp5PUeJbBLO4xcs9kd+lbK9fP
> XKCxeN6Ih63p4AR/PkwJQqYW1m/i/xSdMcozfOF5nkGyVGqW9XcLS9NEVLT4JzYg
> PaU1ZiuhjxNIsF28x6ewSDadPExkOyXMDMRqHC23udPtQt4P9QMYwwsDTBn77mzt
> sCK/WL486Rewl2wWJGwTaYG8HieFQF0/ZsrKFGlB7u3zzJx608SdiXxvt/w29keo
> 0UPzl0se0imAhSLEbwHe4keS7SGofncoCU4u1bacfRMnngCf2irpyGElFfYrlH32
> bPhIQBG4pZp3noHM8O6cv/w5xCtE0nZ9ROV4pI1xzPFB4yDiCDV/LXLYV0RCHW92
> /fteZAYYLqC/BQvyRi/eZ0XAM+a0L2rdm+ggFI/Vcq+VfT/gjv7UfzcwsfS/J1eA
> NawubrlcvuH430K4pNIPPbwfybwV6eNkt6YbffE4cgOhFGUtMWuph6cVEn/Ic0cY
> MlDR+t7p0PNQGZ0KeqpEkydhLEiQGbUPfmtTYRY64ZrwiSRT3ouHsgO88/G/Ehvt
> jTce6S4XY43Bp6sAu5mjdD4+ObSWbAMBwMN92tlQ0yZ5ctvx4qVLEV/ld/QBjayG
> ryzjZ0zP3uclEvDAuP/aUsX1ocS1tW7heMeyqC0tb8oUslTf9kwjx/VAZLQZyvqy
> a1uYDgrHqVslKYc2BffFns33tfkia4+8Y6NkoVOmuB0wdOnCSm+QbEJT11bJVh7+
> UwL/g5ih2c0/xQgvBF5sGvOANy2hJGFulehZ4qcjcsFw3YQFHnUIobnjoxuXkta1
> uB690Wol18v0Xkf9+19tYx65+3h6iss/2Qw9FhiJyVFS++a3Z70NSlbC2MJz+TH2
> HCp0Z+qiikP/FohZbz5hru9luTPx7uM44AGI1MFRjj1275CMWeEAZCEx4pZUkL/G
> 5xWDDCxN0FJorkuI3yUw7CKcN6c7hcAM5iOMO91SgpS5vIco0/H2BTVl8XDO1tt6
> ngbYuGEhZhHNExn6RRk1KIOx08USJ9i+iPqB8dVT8tDGK+VAF/9M95uEhZy5d9g0
> NhbpMx1EPgVk/E3+VNKBB1zgxsnkvjzCnR+F65h8A+aeDj4jvrHowIkqcdL45IVX
> cWjuYmVe3uOlDMLF/q2X3Rh4tOTtGQA1ApJdfXBDzj//hFudDNgb0OJjLTuyg2tG
> xgn6qPfcNcO9WKbiqBhU20FQnTUiMyEMF1pW/4OckJ3fIe86V3JhIkP5w1l6F5K4
> 7npniPO9gXTfDAFDbNEwwiCb2ejVPqMjRUdI/PJwvpXXRNJIiAc3+jRhhJ8xdipY
> 2SFnWugLkR0bC8i/Lbf9djpUSTwuxgb+GcXUCpA1S1pfWECPwL+jzQAIAGwIV3ly
> dk6XlyNrmFkpC9s+/dbKfStXbGmy6tSbSACBJHyXq2OaERsbQsbXkp1DyljuIbG/
> raOoq1ewuoc0Ie/6C8RA/QUcY+uvszsw/HVs3W4eUtc+YDUX+p3+ptZBE+wL4lHX
> f67P5++gsI/IajT/a+cOm6tzkVPpJjdJW0yN1tAoCeAdEsP8fs7JnmOX0MddkGAK
> bZyPnRYqC8tNjyvp656cYf3250W2dlkjWQQ122WjjLYRdiPIimEs2rm8IlpvIT5G
> u4ejUnsfq+js1GBUyv7O3WZilDOZMFU26W6rCOvhCdwMu95Hwvqmqm7ofCJ+vbSZ
> O7QkkApB54koKX3H8FjiBdeqSbk9/Ej2WVUvhEI6MwrFX4vDQR9RkRtW8tH/iQey
> elV5ABcN+sLSgclgrVFXle03SkZrjWZzbKZ84k6W6g5Od9vKj9gTiKaPzddd3EK+
> KbN/RtQmZcT77ceABHzdOQ0HKe6L7GI56Q3Y1eV66v6xL4QwBgroYA4Tg4dy7Ddk
> TcKvUInyEXZRM1A3vkUQk5mZvatHmnOwVyi0PTVyO3isuFLoNwIp9xDhEZJsDd5B
> qHHnjmlVtpE0SzD8EVrKAJAO4/fllZKd/hzv14rUSZ7ORl7PRdSzO5933dw+v6Bb
> Nut2uIfzAAW1xeadYtWufE50qDVraWS+oy9Iyeat0RRdxEx7+luz7iuvTDcaUa00
> +Wygu4bWGCLvO3EpEq0JK/1H3Twa2xc6FR9T1Bg8CJVsVGCizfxD0WXQuoLzOzpb
> uYlaEX18UoomDHFo+8JrCZwGKBgSlUqwehhUA75Yh/S/DqfZnYzK6RUekvms0We6
> dNcP8H5OY+f3rCcKF2FY1Gz6QE03GmrguRxVS2TIRPUo90XuMBMxQSihC7LLHA3d
> cjQC6biOUZPq1RoeRs6xx+aLgmS0BZgYwqUl7H5RCauDx8N51On39ZWAkDXZTd1O
> p0L+a526J2AjK19PKjB/OcdJcFyQBQgO6abCcBZ2ooWhFsxL4JgBX75w+WAsSBmE
> kol3waKHsVC23TvPG2NoNHeh48RZfDrGy0hYIk2tymfW0KhAwpu6Ou03BlojHR4j
> zl1NPiRW9SjvMEvpZtZF
> =Mrt1
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20140521/4e831d5e/attachment.html>


More information about the sword-devel mailing list