[sword-devel] RTFHTML filter bugs

Jaak Ristioja jaak at ristioja.ee
Wed May 21 05:59:45 MST 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

So this means that actually we want non-standard RTF (someone should
update the wiki). Should we assume UTF-8? Are you sure we don't have any
modules with ISO-8859-something encoded values?

If we choose any ASCII superset encoding we have to consider at least
the two points:

  * Since the RTF control words and delimeters are specified in ASCII
only, we need to decide whether how the bytes of the superset act as
delimeters and parts of "RTF" control words. For example, whether the
Unicode letter, number, spacing, punctuation, control etc characters
constitute parts of RTF control words or act as delimiters.

  * In case of encodings where characters may consist of multiple bytes
(e.g. the variable-length UTF-8) we must consider the character
bondaries. We can't just pass through any non-ASCII byte values. For
example, the following bit sequence wouldn't make sense:

  11100010 01011100 10000010 01110001 10101100 01100011

which is an UTF-8 encoded Euro sign, €, interleaved with bytes of the
ASCII string "\qc". It just doesn't make sense, whereas the following
sequences would be correct:

  11100010 10000010 10101100 01011100 01110001 01100011 (€\qc)
  01011100 01110001 01100011 11100010 10000010 10101100 (\qc€)

So depending on the encoding it were correct to detect such cases,
otherwise we end up with invalid Unicode output.

Blessings,
Jaak

On 21.05.2014 15:19, Chris Burrell wrote:
> I believe some conf files have direct unicode (rather than escaped
> sequences) in them and that is preferred.
> 
> On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
> <mailto:jaak at ristioja.ee>> wrote:
> 
>     I've never done BiDi, but I'm not sure I need to take that into account
>     while fixing the RTF parsing. As I currently understand it, this
>     particular piece of code does not support any part from the RTF spec
>     dealing with bidirectional text handling. Hence all BiDi information
>     contained in the configuration file strings (e.g. About=) is contained
>     either in the plain ASCII text or the \u<num> Unicode escapes which this
>     algorithm should pass through unmodified.
> 
>     ...except for HTML entities which should actually be escaped. This bug
>     in the algorithm I previously failed to notice. Additionally I forgot
>     that non-ASCII characters in the input string should also lead to
>     parsing failure.
> 
>     Jaak
> 
> 
>     On 20.05.2014 21:01, David Haslam wrote:
>     > Take care with Right to Left languages such as Hebrew.
>     >
>     > i.e. After any patches to the filter, please include some testing
>     for BiDi
>     > text in the About= field and others.
>     >
>     > David
>     >
>     >
>     >
>     > --
>     > View this message in context:
>     http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
>     > Sent from the SWORD Dev mailing list archive at Nabble.com.
>     >
>     > _______________________________________________
>     > sword-devel mailing list: sword-devel at crosswire.org
>     <mailto:sword-devel at crosswire.org>
>     > http://www.crosswire.org/mailman/listinfo/sword-devel
>     > Instructions to unsubscribe/change your settings at above page
>     >
> 
> 
> 
>     _______________________________________________
>     sword-devel mailing list: sword-devel at crosswire.org
>     <mailto:sword-devel at crosswire.org>
>     http://www.crosswire.org/mailman/listinfo/sword-devel
>     Instructions to unsubscribe/change your settings at above page
> 
> 
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQgcBAEBAgAGBQJTfKM/AAoJELozJlbjIn79gXpAAMxwoq17dvVzCikAplQUjON0
xDJXlDFfKK14w8xj11NSUvJEPjVWlwTi82WzEplQBKfkxtFY09010ZB5IKotEtSP
dcJMjzc4FmuJmPifB7s3gtEOQ81OThMArlnq/aFHvGj6+5D8qjFkQiqOzSJeaORS
C8dPobXSnJkJ/g3zKCdJf/k5msphFbmuIQOD4Ovco2ZHHlukL8QNd8pt3RcPN4Hy
BMxYx9glw3+YJK5Jj63isdsmOGLeRory3PDcHZoPJzu8zssW78Chlsgoh+xWlfkn
zI5PdP1ARhq7K/kUnPp7jXx3LDFiEbmPjrNBi/A03k+n7s2oZWdxm9uBfEEq5VpB
DpdCA19msaEE+fOWOyAAvvZstnCxYrrd01j+HxXUGoA4JHBBVQo01H5udfOdbiBu
nSI5M0GUKBjSSfLSmrh2oTC0qniVMRw4t+IAIJU1chjfBCsoNAx6xTiDE8x+hpjd
A+s8wvgBU0gNbqeOMvWXkHeOWSu7O0oPEp0vVl+6fUPPFDHGR1+2vPXLnCcbASwj
pEJwls9IBis7touUlIt4stlois1Imtw8zKGXXU8h0UmSgRHK0G2Ck8clNptClkMY
+9xP+TGXZI0q+WlzA7M4aD2puQAiJ0iJTm/kV+QGF/1RiaWNGWTG7Oxfufz5XdDn
xqTrAkYoVw3a+ZRgZPs4YbyK3ysVqncvAOFKuqLcEEwiA4zEYztGxPMAhcypQJFH
n6ORlF3/Kmkukj3eapanznmcvoZ+H/APKNWmo2b+TZ10WABCtZVDO+pd1Ed+l2U5
EytGhMYEqNSMqV109k3It9Ll7a8GVQa6k7AX8/BSXlh6/GaaoIzkSgGJBFAU8Zsj
dW7u6O7wBOTBmE+lUUrwA3igveDhTDhzjORE7Ek74xkhoNVwh1DmqWwJGZbIGb5R
47yWwxql4pqS4jq3M+TM8SUZaeY/NTjRTn+WLFBGahKVH5Gg/NiB6onfBBRLyYwK
iorFYngEhpKDNJBPp8rfSIg4NxhbupwG9B1Bbrdg6Kj+E+kGsXDuDkBWQEgf1Jwv
3XbiDBEjUf2wr4TdbUx9GrwrBNP7q9YW0RmbQGlvIahVwtr3/PJGhiU/kS47fAZf
HQMac1US7eYgtW5hzH/YG+41cCI9J0byZBEuSJS2GuSd0LD0Of4bPLxyOxiXqvTU
kwSPIQwsBOZpFIA5Qfc35x5KxVqCGUYBvXhglpZtZGlGr8uIPpshc1gz9ukCejuz
754upiYTlCzocKpvPbER9QpMZFYb+iDTdc4bU8whmxkP8ATKSDQmYIqUS2ohLKV8
co5X0741kRaG5oNOBBrM7kn/9nWgFNspFBkJAvGLbD8h6R8S11cu7INrXzJjxv/e
bCAxGXb2UQXXUe18FCYeqUvl5VdQOQt3f7gja3XbitCKkJjUA6i7t1+5vjuMQsAY
NFliiFxNeNjNE4hIIpvA7G3N+2t0W8IjGsystXm6ONN0lM78eLZLLlsrfkPi8NgR
Nydc78zEJfGr8APkiYleIYTi6ftgtDrI9927wNWqgIPqO4vqA1TZngX8wx6YPJou
uF8cSnI0PlcOfEKtsBgZedOpbZlqAt61wvMGMW0YUfiL5LhuP95KQekqDMMBDCQX
mGMehJHRJ5PvoDt8485lGOWdwXn6T7PlakZ1UCtYeMV0Nx2PfPBfU7bnCwSRFQKg
vpUhPCkW5qpvlkBLOpPLwkqcZGiSyLL/YSGp6cVExeeQVHc2hI169zGY9dUHBEMN
CaKwI9Wjn5V95bax3gsMlHnY9c1TB/6yLWnVEJAilm5ijgWW5KxstWoJMd/OptY8
QvbsOA7K36HfwOwNCblQCGbUrPjikhXTw8ew1aap4OHqGIKUWCMm3z/eHOPRU5mD
Ce2Z86vwYb9T2PcyqUiZOs1WW9TBZx70Hr2JQmRwgMyWpT4DERjofP83IA8vxZdP
9uKT4j+EBUGoI2zGgE2lapLL/VWrzt6OBMv5iUmR4OIFLdnHevAAy5w53c4+tWjs
SNmjAz8tW5FWiVFR99FQBN6KWXIjKdJGQl+zccOlE0zBQe2grnqFmUeuuBbPiojb
Wch+hqrKDX/VLr/gIP9EErMJ7ZvZ7st+gwPZlFwC7Evf3OCrUnRYIbMI6iLGLoZ6
c9YLbK67hj1Ho+X99XTeoQj8l2V14TSRCFZBmO7Os5L2kXOEiw0yeV8Dn87LJPFp
4VcfgFGLi9FRnI36K4+h5JWoyhrGhNHrHsO60Xs2U3a02fRfeUgn/T1Xf0xXbVMC
gX8zJ3aC15pUy/dJaqJ4HIszzPe5ErO7J9GB7AhjVnx8pEE0xayoJkA4VM0YF8Lk
b/IF04rm/dNlsLL7zRzdGpr2uo9esMzFJDYcHnhInhaE7t2iGR4+cgUdRJKA7NJW
ZumxNz3a1EjeZHRLqRxfT8O6Cc55hG4GwVO7JxUnXJtRMx+ENXZslf4ExGdhcTdf
ntjsfngGemyKYv8aMJ9pDlLFVyR+91xSpFp8QYRDtcP14y5Dfh/jh4Kmdu0BqTzt
Wt0KUUZQlx8Qu8XJbatPiieDmjtQ8HPmhsHQAA+QmLzrhEmakrAjTfpWq5eNYQeQ
ei6tawFllPyuNrez2BOP3nfXuSBlfn2+yBfi3H1mJc8urrFwDtt/zqTHdoOtyCNO
PVaqMROmVzgdKg7yyXTBek3UBe8TxMWigvepRvxkGlmMZQkW42/5ft0269esY/bw
tuy57vDPyvQfrJzpN62y
=RNpJ
-----END PGP SIGNATURE-----



More information about the sword-devel mailing list