[sword-devel] RTFHTML filter bugs

Jaak Ristioja jaak at ristioja.ee
Wed May 21 06:16:17 MST 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

To sum up, we would need to agree on and specify a RTF subset which is
Unicode-aware (UTF-8 only?), and implement an Unicode-aware transducer
for it.

On 21.05.2014 15:59, Jaak Ristioja wrote:
> So this means that actually we want non-standard RTF (someone 
> should update the wiki). Should we assume UTF-8? Are you sure we 
> don't have any modules with ISO-8859-something encoded values?
> 
> If we choose any ASCII superset encoding we have to consider at 
> least the two points:
> 
> * Since the RTF control words and delimeters are specified in ASCII
> only, we need to decide whether how the bytes of the superset act
> as delimeters and parts of "RTF" control words. For example, 
> whether the Unicode letter, number, spacing, punctuation, control 
> etc characters constitute parts of RTF control words or act as 
> delimiters.
> 
> * In case of encodings where characters may consist of multiple 
> bytes (e.g. the variable-length UTF-8) we must consider the 
> character bondaries. We can't just pass through any non-ASCII byte 
> values. For example, the following bit sequence wouldn't make 
> sense:
> 
> 11100010 01011100 10000010 01110001 10101100 01100011
> 
> which is an UTF-8 encoded Euro sign, €, interleaved with bytes of 
> the ASCII string "\qc". It just doesn't make sense, whereas the 
> following sequences would be correct:
> 
> 11100010 10000010 10101100 01011100 01110001 01100011 (€\qc) 
> 01011100 01110001 01100011 11100010 10000010 10101100 (\qc€)
> 
> So depending on the encoding it were correct to detect such cases,
>  otherwise we end up with invalid Unicode output.
> 
> Blessings, Jaak
> 
> On 21.05.2014 15:19, Chris Burrell wrote:
>> I believe some conf files have direct unicode (rather than 
>> escaped sequences) in them and that is preferred.
> 
>> On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee 
>> <mailto:jaak at ristioja.ee>> wrote:
> 
>> I've never done BiDi, but I'm not sure I need to take that into 
>> account while fixing the RTF parsing. As I currently understand 
>> it, this particular piece of code does not support any part from 
>> the RTF spec dealing with bidirectional text handling. Hence all 
>> BiDi information contained in the configuration file strings 
>> (e.g. About=) is contained either in the plain ASCII text or the 
>> \u<num> Unicode escapes which this algorithm should pass through 
>> unmodified.
> 
>> ...except for HTML entities which should actually be escaped. 
>> This bug in the algorithm I previously failed to notice. 
>> Additionally I forgot that non-ASCII characters in the input 
>> string should also lead to parsing failure.
> 
>> Jaak
> 
> 
>> On 20.05.2014 21:01, David Haslam wrote:
>>> Take care with Right to Left languages such as Hebrew.
>>> 
>>> i.e. After any patches to the filter, please include some 
>>> testing
>> for BiDi
>>> text in the About= field and others.
>>> 
>>> David
>>> 
>>> 
>>> 
>>> -- View this message in context:
>> http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
>>
>>
>> 
> Sent from the SWORD Dev mailing list archive at Nabble.com.
>>> 
>>> _______________________________________________ sword-devel 
>>> mailing list: sword-devel at crosswire.org
>> <mailto:sword-devel at crosswire.org>
>>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>>> Instructions to unsubscribe/change your settings at above page
>>> 
> 
> 
> 
>> _______________________________________________ sword-devel 
>> mailing list: sword-devel at crosswire.org 
>> <mailto:sword-devel at crosswire.org> 
>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>> Instructions to unsubscribe/change your settings at above page
> 
> 
> 
>> _______________________________________________ sword-devel 
>> mailing list: sword-devel at crosswire.org 
>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>> Instructions to unsubscribe/change your settings at above page
> 
> 
> 
> _______________________________________________ sword-devel
> mailing list: sword-devel at crosswire.org 
> http://www.crosswire.org/mailman/listinfo/sword-devel Instructions 
> to unsubscribe/change your settings at above page
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQgcBAEBAgAGBQJTfKcZAAoJELozJlbjIn79QeRAAIuemi7ZxYbt+fLCKjmJq5eF
Twas8zabBkm55uco6lFZ+gQaE51i7UFBR9zoVeZqC3PBXHylD1Vaki2jcFIJWEuQ
7rmw8o1YM0q/dAobuqVbHnxpzpEbPXWEajhipb4B91BNYQJWqNzo3bx/y0RVeV/8
QkYj4CmXG0DB5oAnzaq54ZJr3pbYms/kEwhbMSe+lQUiIjAsuTa3glJgCfBt9QA2
b0fkyVm3M85DPQD9Qn7Iucbb38UirTwVjxFWt4ds5oWSuSWCyJKl0/TSN2IV1zqg
NDH+TiNQ/SekimeX4F8/wNMV5pfNhmE+wdiRfaCXLO9a6GY7bx2IlOZVbsz9fB6L
6AB5LQqBYd1YY/3vRnFNHXgX7GP6iEqkp36xMFDH4dTcDp8o87Xh4T5ugumb2io2
i6IUwYO6jhEKVEMSpD5qiexYrBWoJYAsCXSyNA7N+Aw+ktOi70aOOPD/vAsPpqMI
1kFss3zc9fmDGVMmy43R5TKTy1qB3tpTqrHyWFZP8xSEJJt2DbqszC2HAr8KTmea
vz6d/Xb58qLhuBOHKP2Dhcr9BZ10RhrWj/q9mBv6nWzo6xE4H7PySpTy64LciGRW
ZNoVS3DfBOPTaS3jmJdduBtSSdI3huVfeQlRBkMDutd8QdEfECgvyXYX/5Q+pP4C
APT6gRvDSSzwg98H0m+9REa7kOcXbaaBH1A6aiUhU8PE7B4xqCSU7vwAx7xGwtb9
4AmYmPi+ei3p6BxY42bAihnfz9BDSjOETUOx0/psF2Cv+Pvg4PGH/vRcSSesriGX
ouGZ57p4Q77XsITtI8umFi1tvgyAHmigO5M1Cd2gSLtXlB1laSum3K3vuHvF4r16
WKnlVlFJ2eEf2IP3TGT7bUg2PdgaAQoY2F1L8FHLZDup0iZrzXD73CA4vcfZRUSg
IpkZDHEQbRA0ZUID9ACDGWo/KXxI/EtgWdLKXDCcX2fBDuzEWKRjItnCE/IsU+gG
sxVJwsR3eKK4eU0bD1fHqnb9kVhwuGqgOL2YYmthKiX6gXddR4nIpFIOjztAar4p
P/WWSIa44P3yNXvVeD/J3kpNABMzart+mW8Hw0jqRi6SfxSAi/2NZOk5Pq1bASCY
sE9nYGCCLVbrZCOLorg/XlwXfE2ltrteU/5MXvX8CqezDcq1x+4ljWyqCJ/SkaW6
NRxUzcP+DIdm0yHh+HCzGRtqVyn3TnbOeX6vzbACQMxh9iMP1Dzy9Vr0S8J1Vi0w
iox6Hg51jhtdFOrrZDGm/rTnIxVe7fkUiHDa0bWtVED044P74CNL46yRrdlnyyWV
sBVwpjZQ/UwpdRrb4R5wtoYv1DT8YjNyfIGIkheH92hIhHZvz9RkzW8cWpw8DRzU
BrqvWjyCEF8evu5C3/KNPMLgz6w86Fqonma8BmM+MdriV0/lmshLVwtLOHXzKaLW
XspcL85Ar3MKYI1GO7pbBIw9uZ2ki4hJxKqHKeB3fBsdBue36koIMtN6lklsBHmd
2hrbNSDalD1/NBCdZ5/5yck32bx4gqTvbs/lC2eyITEoDHm4qoJx5MLpBq5HE2tW
caSuCElSi4xqZpJjcjjGB4RNQkIJv89IRAUNivlVRzQreMumKMhFM+npH4AYKxyD
0VACjXN2QxKa5+UaZXPicdUhoTuYJlYoXhhz1+pszfwmBIPrweKnr81a2HvI3Ytg
xiR1GJZQlnDZKYFqZWMhZAaiSWk452On072KdzMciZ5UrGGp/Za+kA0+W7xvZGA9
TimH2Y3fU17ZvYXBIaugqxyt2G0usjxv14n/pWQFJDaj26TnXpdEuQHI+mDIwsE0
kokoStgcFEqYYRshy6BZncv3ksKSmNhBc4eqbOq+Z37CGT1HHmUzUvp4JWyZJza7
1NJzteZ2X/gjJXqcZPIW0LROz2EESSo7wNmBCBH/GYQ/Fy4tCjwg2YYyM8Bj98WM
zHPlnwyEcxYPGDWRddzoPOjo2nejtvLju0RnQP1+y41qt/5hpLWnagIZL/waAJ4R
/clCrvzZjRfvmHaecc+9rJm6ZTSC//HwP2Pf8FYg8eEDOGcVGRthTFvmENguf1xw
e9P0Mjvt0rkkGZOhdqWxFE6N1Z7cB2kmhLGHViTK4GeB40YqEwvoI2flzbgeWC+j
SHmiPerkPasPwMDiGwwLzTcbKfyoYr1KqAzz1ZelFNKgP9lWVfHIS+EkEghwPKV/
STvE53H2wYzlkPF4Cg+2RClKMmbWwuAPFXwgl/4/7xE9cGTbe7cjd/CDJ6qpH4tR
GcwOPIKY9UJjXfA5imtyoy3amjw+T8PtsgVz2uGAanhzUcpHMObib5pYODWuRodN
YJZKkzGhTpzJwP9fg+S5ugQ1tdOPXsu/kAOkv405VDwvzrMR8hDqIv+PCAroA/Lp
Y2YQPfrz9tHz19L44k2eVbeD2bpVt97vMIMWhEBkiYpkst2xyZ5TYlXO8SEXBh6C
Y4uHI82j5d4uZcK+Ux9fEKe7hqnl1txDnB7t9L9nfINM7BYjq68A351xH53mD8Bm
G64IeAyT6DscfLsy48yViNeADo+ncig8H/gctEeA7yZJccAqcQvgAoW9BBGabnNi
y3OMenIqOBERqHdNZ0Ne6Av/42s0/H2RLREUFJtOqHB2IwjdhfjIHIHhrZliSWAG
EnRIJVQrZHQIWB1ZFFFJiPTdQWmyNBIFAivbwWtEuDZa7TO+Z09m7R+DyolP0GPm
Gyf7bzEMhlF5PuGlXeya
=toe/
-----END PGP SIGNATURE-----



More information about the sword-devel mailing list