[sword-devel] RTFHTML filter bugs
Jaak Ristioja
jaak at ristioja.ee
Wed May 21 06:16:17 MST 2014
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
To sum up, we would need to agree on and specify a RTF subset which is
Unicode-aware (UTF-8 only?), and implement an Unicode-aware transducer
for it.
On 21.05.2014 15:59, Jaak Ristioja wrote:
> So this means that actually we want non-standard RTF (someone
> should update the wiki). Should we assume UTF-8? Are you sure we
> don't have any modules with ISO-8859-something encoded values?
>
> If we choose any ASCII superset encoding we have to consider at
> least the two points:
>
> * Since the RTF control words and delimeters are specified in ASCII
> only, we need to decide whether how the bytes of the superset act
> as delimeters and parts of "RTF" control words. For example,
> whether the Unicode letter, number, spacing, punctuation, control
> etc characters constitute parts of RTF control words or act as
> delimiters.
>
> * In case of encodings where characters may consist of multiple
> bytes (e.g. the variable-length UTF-8) we must consider the
> character bondaries. We can't just pass through any non-ASCII byte
> values. For example, the following bit sequence wouldn't make
> sense:
>
> 11100010 01011100 10000010 01110001 10101100 01100011
>
> which is an UTF-8 encoded Euro sign, €, interleaved with bytes of
> the ASCII string "\qc". It just doesn't make sense, whereas the
> following sequences would be correct:
>
> 11100010 10000010 10101100 01011100 01110001 01100011 (€\qc)
> 01011100 01110001 01100011 11100010 10000010 10101100 (\qc€)
>
> So depending on the encoding it were correct to detect such cases,
> otherwise we end up with invalid Unicode output.
>
> Blessings, Jaak
>
> On 21.05.2014 15:19, Chris Burrell wrote:
>> I believe some conf files have direct unicode (rather than
>> escaped sequences) in them and that is preferred.
>
>> On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
>> <mailto:jaak at ristioja.ee>> wrote:
>
>> I've never done BiDi, but I'm not sure I need to take that into
>> account while fixing the RTF parsing. As I currently understand
>> it, this particular piece of code does not support any part from
>> the RTF spec dealing with bidirectional text handling. Hence all
>> BiDi information contained in the configuration file strings
>> (e.g. About=) is contained either in the plain ASCII text or the
>> \u<num> Unicode escapes which this algorithm should pass through
>> unmodified.
>
>> ...except for HTML entities which should actually be escaped.
>> This bug in the algorithm I previously failed to notice.
>> Additionally I forgot that non-ASCII characters in the input
>> string should also lead to parsing failure.
>
>> Jaak
>
>
>> On 20.05.2014 21:01, David Haslam wrote:
>>> Take care with Right to Left languages such as Hebrew.
>>>
>>> i.e. After any patches to the filter, please include some
>>> testing
>> for BiDi
>>> text in the About= field and others.
>>>
>>> David
>>>
>>>
>>>
>>> -- View this message in context:
>> http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
>>
>>
>>
> Sent from the SWORD Dev mailing list archive at Nabble.com.
>>>
>>> _______________________________________________ sword-devel
>>> mailing list: sword-devel at crosswire.org
>> <mailto:sword-devel at crosswire.org>
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>>
>
>
>
>> _______________________________________________ sword-devel
>> mailing list: sword-devel at crosswire.org
>> <mailto:sword-devel at crosswire.org>
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
>
>> _______________________________________________ sword-devel
>> mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
>
>
> _______________________________________________ sword-devel
> mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel Instructions
> to unsubscribe/change your settings at above page
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQgcBAEBAgAGBQJTfKcZAAoJELozJlbjIn79QeRAAIuemi7ZxYbt+fLCKjmJq5eF
Twas8zabBkm55uco6lFZ+gQaE51i7UFBR9zoVeZqC3PBXHylD1Vaki2jcFIJWEuQ
7rmw8o1YM0q/dAobuqVbHnxpzpEbPXWEajhipb4B91BNYQJWqNzo3bx/y0RVeV/8
QkYj4CmXG0DB5oAnzaq54ZJr3pbYms/kEwhbMSe+lQUiIjAsuTa3glJgCfBt9QA2
b0fkyVm3M85DPQD9Qn7Iucbb38UirTwVjxFWt4ds5oWSuSWCyJKl0/TSN2IV1zqg
NDH+TiNQ/SekimeX4F8/wNMV5pfNhmE+wdiRfaCXLO9a6GY7bx2IlOZVbsz9fB6L
6AB5LQqBYd1YY/3vRnFNHXgX7GP6iEqkp36xMFDH4dTcDp8o87Xh4T5ugumb2io2
i6IUwYO6jhEKVEMSpD5qiexYrBWoJYAsCXSyNA7N+Aw+ktOi70aOOPD/vAsPpqMI
1kFss3zc9fmDGVMmy43R5TKTy1qB3tpTqrHyWFZP8xSEJJt2DbqszC2HAr8KTmea
vz6d/Xb58qLhuBOHKP2Dhcr9BZ10RhrWj/q9mBv6nWzo6xE4H7PySpTy64LciGRW
ZNoVS3DfBOPTaS3jmJdduBtSSdI3huVfeQlRBkMDutd8QdEfECgvyXYX/5Q+pP4C
APT6gRvDSSzwg98H0m+9REa7kOcXbaaBH1A6aiUhU8PE7B4xqCSU7vwAx7xGwtb9
4AmYmPi+ei3p6BxY42bAihnfz9BDSjOETUOx0/psF2Cv+Pvg4PGH/vRcSSesriGX
ouGZ57p4Q77XsITtI8umFi1tvgyAHmigO5M1Cd2gSLtXlB1laSum3K3vuHvF4r16
WKnlVlFJ2eEf2IP3TGT7bUg2PdgaAQoY2F1L8FHLZDup0iZrzXD73CA4vcfZRUSg
IpkZDHEQbRA0ZUID9ACDGWo/KXxI/EtgWdLKXDCcX2fBDuzEWKRjItnCE/IsU+gG
sxVJwsR3eKK4eU0bD1fHqnb9kVhwuGqgOL2YYmthKiX6gXddR4nIpFIOjztAar4p
P/WWSIa44P3yNXvVeD/J3kpNABMzart+mW8Hw0jqRi6SfxSAi/2NZOk5Pq1bASCY
sE9nYGCCLVbrZCOLorg/XlwXfE2ltrteU/5MXvX8CqezDcq1x+4ljWyqCJ/SkaW6
NRxUzcP+DIdm0yHh+HCzGRtqVyn3TnbOeX6vzbACQMxh9iMP1Dzy9Vr0S8J1Vi0w
iox6Hg51jhtdFOrrZDGm/rTnIxVe7fkUiHDa0bWtVED044P74CNL46yRrdlnyyWV
sBVwpjZQ/UwpdRrb4R5wtoYv1DT8YjNyfIGIkheH92hIhHZvz9RkzW8cWpw8DRzU
BrqvWjyCEF8evu5C3/KNPMLgz6w86Fqonma8BmM+MdriV0/lmshLVwtLOHXzKaLW
XspcL85Ar3MKYI1GO7pbBIw9uZ2ki4hJxKqHKeB3fBsdBue36koIMtN6lklsBHmd
2hrbNSDalD1/NBCdZ5/5yck32bx4gqTvbs/lC2eyITEoDHm4qoJx5MLpBq5HE2tW
caSuCElSi4xqZpJjcjjGB4RNQkIJv89IRAUNivlVRzQreMumKMhFM+npH4AYKxyD
0VACjXN2QxKa5+UaZXPicdUhoTuYJlYoXhhz1+pszfwmBIPrweKnr81a2HvI3Ytg
xiR1GJZQlnDZKYFqZWMhZAaiSWk452On072KdzMciZ5UrGGp/Za+kA0+W7xvZGA9
TimH2Y3fU17ZvYXBIaugqxyt2G0usjxv14n/pWQFJDaj26TnXpdEuQHI+mDIwsE0
kokoStgcFEqYYRshy6BZncv3ksKSmNhBc4eqbOq+Z37CGT1HHmUzUvp4JWyZJza7
1NJzteZ2X/gjJXqcZPIW0LROz2EESSo7wNmBCBH/GYQ/Fy4tCjwg2YYyM8Bj98WM
zHPlnwyEcxYPGDWRddzoPOjo2nejtvLju0RnQP1+y41qt/5hpLWnagIZL/waAJ4R
/clCrvzZjRfvmHaecc+9rJm6ZTSC//HwP2Pf8FYg8eEDOGcVGRthTFvmENguf1xw
e9P0Mjvt0rkkGZOhdqWxFE6N1Z7cB2kmhLGHViTK4GeB40YqEwvoI2flzbgeWC+j
SHmiPerkPasPwMDiGwwLzTcbKfyoYr1KqAzz1ZelFNKgP9lWVfHIS+EkEghwPKV/
STvE53H2wYzlkPF4Cg+2RClKMmbWwuAPFXwgl/4/7xE9cGTbe7cjd/CDJ6qpH4tR
GcwOPIKY9UJjXfA5imtyoy3amjw+T8PtsgVz2uGAanhzUcpHMObib5pYODWuRodN
YJZKkzGhTpzJwP9fg+S5ugQ1tdOPXsu/kAOkv405VDwvzrMR8hDqIv+PCAroA/Lp
Y2YQPfrz9tHz19L44k2eVbeD2bpVt97vMIMWhEBkiYpkst2xyZ5TYlXO8SEXBh6C
Y4uHI82j5d4uZcK+Ux9fEKe7hqnl1txDnB7t9L9nfINM7BYjq68A351xH53mD8Bm
G64IeAyT6DscfLsy48yViNeADo+ncig8H/gctEeA7yZJccAqcQvgAoW9BBGabnNi
y3OMenIqOBERqHdNZ0Ne6Av/42s0/H2RLREUFJtOqHB2IwjdhfjIHIHhrZliSWAG
EnRIJVQrZHQIWB1ZFFFJiPTdQWmyNBIFAivbwWtEuDZa7TO+Z09m7R+DyolP0GPm
Gyf7bzEMhlF5PuGlXeya
=toe/
-----END PGP SIGNATURE-----
More information about the sword-devel
mailing list