[sword-devel] RTFHTML filter bugs

Greg Hellings greg.hellings at gmail.com
Wed May 21 09:44:34 MST 2014


On May 21, 2014 8:00 AM, "Jaak Ristioja" <jaak at ristioja.ee> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> So this means that actually we want non-standard RTF (someone should
> update the wiki). Should we assume UTF-8? Are you sure we don't have any
> modules with ISO-8859-something encoded values?
>

The wiki states that the Unicode character is preferred,  at least for conf
files, over the RTF escaped value. Specifically it must be Unicode encoded
as UTF 8 or CP1252.

> If we choose any ASCII superset encoding we have to consider at least
> the two points:
>
>   * Since the RTF control words and delimeters are specified in ASCII
> only, we need to decide whether how the bytes of the superset act as
> delimeters and parts of "RTF" control words. For example, whether the
> Unicode letter, number, spacing, punctuation, control etc characters
> constitute parts of RTF control words or act as delimiters.
>
>   * In case of encodings where characters may consist of multiple bytes
> (e.g. the variable-length UTF-8) we must consider the character
> bondaries. We can't just pass through any non-ASCII byte values. For
> example, the following bit sequence wouldn't make sense:
>
>   11100010 01011100 10000010 01110001 10101100 01100011
>

Did you literally split the individual bytes of the euro character around
the other bytes?  What possibly valid encoding permits that? Is that a
valid UTF 8 sequence? If not, then the file fails to be UTF 8 encoded and
the engine either will error or otherwise behave in undefined ways due to
invalid input.

--Greg

> which is an UTF-8 encoded Euro sign, €, interleaved with bytes of the
> ASCII string "\qc". It just doesn't make sense, whereas the following
> sequences would be correct:
>
>   11100010 10000010 10101100 01011100 01110001 01100011 (€\qc)
>   01011100 01110001 01100011 11100010 10000010 10101100 (\qc€)
>
> So depending on the encoding it were correct to detect such cases,
> otherwise we end up with invalid Unicode output.
>
> Blessings,
> Jaak
>
> On 21.05.2014 15:19, Chris Burrell wrote:
> > I believe some conf files have direct unicode (rather than escaped
> > sequences) in them and that is preferred.
> >
> > On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
> > <mailto:jaak at ristioja.ee>> wrote:
> >
> >     I've never done BiDi, but I'm not sure I need to take that into
account
> >     while fixing the RTF parsing. As I currently understand it, this
> >     particular piece of code does not support any part from the RTF spec
> >     dealing with bidirectional text handling. Hence all BiDi information
> >     contained in the configuration file strings (e.g. About=) is
contained
> >     either in the plain ASCII text or the \u<num> Unicode escapes which
this
> >     algorithm should pass through unmodified.
> >
> >     ...except for HTML entities which should actually be escaped. This
bug
> >     in the algorithm I previously failed to notice. Additionally I
forgot
> >     that non-ASCII characters in the input string should also lead to
> >     parsing failure.
> >
> >     Jaak
> >
> >
> >     On 20.05.2014 21:01, David Haslam wrote:
> >     > Take care with Right to Left languages such as Hebrew.
> >     >
> >     > i.e. After any patches to the filter, please include some testing
> >     for BiDi
> >     > text in the About= field and others.
> >     >
> >     > David
> >     >
> >     >
> >     >
> >     > --
> >     > View this message in context:
> >
http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
> >     > Sent from the SWORD Dev mailing list archive at Nabble.com.
> >     >
> >     > _______________________________________________
> >     > sword-devel mailing list: sword-devel at crosswire.org
> >     <mailto:sword-devel at crosswire.org>
> >     > http://www.crosswire.org/mailman/listinfo/sword-devel
> >     > Instructions to unsubscribe/change your settings at above page
> >     >
> >
> >
> >
> >     _______________________________________________
> >     sword-devel mailing list: sword-devel at crosswire.org
> >     <mailto:sword-devel at crosswire.org>
> >     http://www.crosswire.org/mailman/listinfo/sword-devel
> >     Instructions to unsubscribe/change your settings at above page
> >
> >
> >
> > _______________________________________________
> > sword-devel mailing list: sword-devel at crosswire.org
> > http://www.crosswire.org/mailman/listinfo/sword-devel
> > Instructions to unsubscribe/change your settings at above page
> >
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.22 (GNU/Linux)
>
> iQgcBAEBAgAGBQJTfKM/AAoJELozJlbjIn79gXpAAMxwoq17dvVzCikAplQUjON0
> xDJXlDFfKK14w8xj11NSUvJEPjVWlwTi82WzEplQBKfkxtFY09010ZB5IKotEtSP
> dcJMjzc4FmuJmPifB7s3gtEOQ81OThMArlnq/aFHvGj6+5D8qjFkQiqOzSJeaORS
> C8dPobXSnJkJ/g3zKCdJf/k5msphFbmuIQOD4Ovco2ZHHlukL8QNd8pt3RcPN4Hy
> BMxYx9glw3+YJK5Jj63isdsmOGLeRory3PDcHZoPJzu8zssW78Chlsgoh+xWlfkn
> zI5PdP1ARhq7K/kUnPp7jXx3LDFiEbmPjrNBi/A03k+n7s2oZWdxm9uBfEEq5VpB
> DpdCA19msaEE+fOWOyAAvvZstnCxYrrd01j+HxXUGoA4JHBBVQo01H5udfOdbiBu
> nSI5M0GUKBjSSfLSmrh2oTC0qniVMRw4t+IAIJU1chjfBCsoNAx6xTiDE8x+hpjd
> A+s8wvgBU0gNbqeOMvWXkHeOWSu7O0oPEp0vVl+6fUPPFDHGR1+2vPXLnCcbASwj
> pEJwls9IBis7touUlIt4stlois1Imtw8zKGXXU8h0UmSgRHK0G2Ck8clNptClkMY
> +9xP+TGXZI0q+WlzA7M4aD2puQAiJ0iJTm/kV+QGF/1RiaWNGWTG7Oxfufz5XdDn
> xqTrAkYoVw3a+ZRgZPs4YbyK3ysVqncvAOFKuqLcEEwiA4zEYztGxPMAhcypQJFH
> n6ORlF3/Kmkukj3eapanznmcvoZ+H/APKNWmo2b+TZ10WABCtZVDO+pd1Ed+l2U5
> EytGhMYEqNSMqV109k3It9Ll7a8GVQa6k7AX8/BSXlh6/GaaoIzkSgGJBFAU8Zsj
> dW7u6O7wBOTBmE+lUUrwA3igveDhTDhzjORE7Ek74xkhoNVwh1DmqWwJGZbIGb5R
> 47yWwxql4pqS4jq3M+TM8SUZaeY/NTjRTn+WLFBGahKVH5Gg/NiB6onfBBRLyYwK
> iorFYngEhpKDNJBPp8rfSIg4NxhbupwG9B1Bbrdg6Kj+E+kGsXDuDkBWQEgf1Jwv
> 3XbiDBEjUf2wr4TdbUx9GrwrBNP7q9YW0RmbQGlvIahVwtr3/PJGhiU/kS47fAZf
> HQMac1US7eYgtW5hzH/YG+41cCI9J0byZBEuSJS2GuSd0LD0Of4bPLxyOxiXqvTU
> kwSPIQwsBOZpFIA5Qfc35x5KxVqCGUYBvXhglpZtZGlGr8uIPpshc1gz9ukCejuz
> 754upiYTlCzocKpvPbER9QpMZFYb+iDTdc4bU8whmxkP8ATKSDQmYIqUS2ohLKV8
> co5X0741kRaG5oNOBBrM7kn/9nWgFNspFBkJAvGLbD8h6R8S11cu7INrXzJjxv/e
> bCAxGXb2UQXXUe18FCYeqUvl5VdQOQt3f7gja3XbitCKkJjUA6i7t1+5vjuMQsAY
> NFliiFxNeNjNE4hIIpvA7G3N+2t0W8IjGsystXm6ONN0lM78eLZLLlsrfkPi8NgR
> Nydc78zEJfGr8APkiYleIYTi6ftgtDrI9927wNWqgIPqO4vqA1TZngX8wx6YPJou
> uF8cSnI0PlcOfEKtsBgZedOpbZlqAt61wvMGMW0YUfiL5LhuP95KQekqDMMBDCQX
> mGMehJHRJ5PvoDt8485lGOWdwXn6T7PlakZ1UCtYeMV0Nx2PfPBfU7bnCwSRFQKg
> vpUhPCkW5qpvlkBLOpPLwkqcZGiSyLL/YSGp6cVExeeQVHc2hI169zGY9dUHBEMN
> CaKwI9Wjn5V95bax3gsMlHnY9c1TB/6yLWnVEJAilm5ijgWW5KxstWoJMd/OptY8
> QvbsOA7K36HfwOwNCblQCGbUrPjikhXTw8ew1aap4OHqGIKUWCMm3z/eHOPRU5mD
> Ce2Z86vwYb9T2PcyqUiZOs1WW9TBZx70Hr2JQmRwgMyWpT4DERjofP83IA8vxZdP
> 9uKT4j+EBUGoI2zGgE2lapLL/VWrzt6OBMv5iUmR4OIFLdnHevAAy5w53c4+tWjs
> SNmjAz8tW5FWiVFR99FQBN6KWXIjKdJGQl+zccOlE0zBQe2grnqFmUeuuBbPiojb
> Wch+hqrKDX/VLr/gIP9EErMJ7ZvZ7st+gwPZlFwC7Evf3OCrUnRYIbMI6iLGLoZ6
> c9YLbK67hj1Ho+X99XTeoQj8l2V14TSRCFZBmO7Os5L2kXOEiw0yeV8Dn87LJPFp
> 4VcfgFGLi9FRnI36K4+h5JWoyhrGhNHrHsO60Xs2U3a02fRfeUgn/T1Xf0xXbVMC
> gX8zJ3aC15pUy/dJaqJ4HIszzPe5ErO7J9GB7AhjVnx8pEE0xayoJkA4VM0YF8Lk
> b/IF04rm/dNlsLL7zRzdGpr2uo9esMzFJDYcHnhInhaE7t2iGR4+cgUdRJKA7NJW
> ZumxNz3a1EjeZHRLqRxfT8O6Cc55hG4GwVO7JxUnXJtRMx+ENXZslf4ExGdhcTdf
> ntjsfngGemyKYv8aMJ9pDlLFVyR+91xSpFp8QYRDtcP14y5Dfh/jh4Kmdu0BqTzt
> Wt0KUUZQlx8Qu8XJbatPiieDmjtQ8HPmhsHQAA+QmLzrhEmakrAjTfpWq5eNYQeQ
> ei6tawFllPyuNrez2BOP3nfXuSBlfn2+yBfi3H1mJc8urrFwDtt/zqTHdoOtyCNO
> PVaqMROmVzgdKg7yyXTBek3UBe8TxMWigvepRvxkGlmMZQkW42/5ft0269esY/bw
> tuy57vDPyvQfrJzpN62y
> =RNpJ
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20140521/27d97a0e/attachment-0001.html>


More information about the sword-devel mailing list