[sword-devel] RTFHTML filter bugs

Jaak Ristioja jaak at ristioja.ee
Thu May 22 01:13:13 MST 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I think I don't understand what you're saying. The frontend should
read the configuration twice? Sword should? Huh? I don't understand.
Why this complexity?! Do you mean that:

 * Sword reads the entire configuration file as CP1252 encoded
   * On failure, re-read the configuration file as UTF-8 encoded

???

If this is the case, then this is error prone (even when reading only
parts of the configuration), because CP1252 and UTF-8 overlap. Hence
data encoded as UTF-8 might be parsed correctly as valid CP1252, even
though it was intended to be UTF-8. I mean I find it likely that valid
UTF-8 strings might be accepted by a perfectly correct CP1252 encoding
checker as valid CP1252.

Jaak

On 21.05.2014 17:45, DM Smith wrote:
> The encoding of the conf is either cp1252 (the default, but called 
> latin 1) or utf-8. The encoding of the conf matches that of the 
> module. This may cause the conf to be read twice once for the 
> default and once for UTF-8, if the module encoding is set to 
> UTF-8.
> 
> There have been confs that are incorrect with regard to this rule.
> 
> In Him, DM
> 
> On May 21, 2014, at 8:59 AM, Jaak Ristioja <jaak at ristioja.ee 
> <mailto:jaak at ristioja.ee>> wrote:
> 
> So this means that actually we want non-standard RTF (someone 
> should update the wiki). Should we assume UTF-8? Are you sure we 
> don't have any modules with ISO-8859-something encoded values?
> 
> If we choose any ASCII superset encoding we have to consider at 
> least the two points:
> 
> * Since the RTF control words and delimeters are specified in ASCII
> only, we need to decide whether how the bytes of the superset act
> as delimeters and parts of "RTF" control words. For example, 
> whether the Unicode letter, number, spacing, punctuation, control 
> etc characters constitute parts of RTF control words or act as 
> delimiters.
> 
> * In case of encodings where characters may consist of multiple 
> bytes (e.g. the variable-length UTF-8) we must consider the 
> character bondaries. We can't just pass through any non-ASCII byte 
> values. For example, the following bit sequence wouldn't make 
> sense:
> 
> 11100010 01011100 10000010 01110001 10101100 01100011
> 
> which is an UTF-8 encoded Euro sign, €, interleaved with bytes of 
> the ASCII string "\qc". It just doesn't make sense, whereas the 
> following sequences would be correct:
> 
> 11100010 10000010 10101100 01011100 01110001 01100011 (€\qc) 
> 01011100 01110001 01100011 11100010 10000010 10101100 (\qc€)
> 
> So depending on the encoding it were correct to detect such cases,
>  otherwise we end up with invalid Unicode output.
> 
> Blessings, Jaak
> 
> On 21.05.2014 15:19, Chris Burrell wrote:
>>>> I believe some conf files have direct unicode (rather than 
>>>> escaped sequences) in them and that is preferred.
>>>> 
>>>> On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee 
>>>> <mailto:jaak at ristioja.ee> <mailto:jaak at ristioja.ee>> wrote:
>>>> 
>>>> I've never done BiDi, but I'm not sure I need to take that 
>>>> into account while fixing the RTF parsing. As I currently 
>>>> understand it, this particular piece of code does not
>>>> support any part from the RTF spec dealing with bidirectional
>>>> text handling. Hence all BiDi information contained in the 
>>>> configuration file strings (e.g. About=) is contained either 
>>>> in the plain ASCII text or the \u<num> Unicode escapes which 
>>>> this algorithm should pass through unmodified.
>>>> 
>>>> ...except for HTML entities which should actually be
>>>> escaped. This bug in the algorithm I previously failed to
>>>> notice. Additionally I forgot that non-ASCII characters in
>>>> the input string should also lead to parsing failure.
>>>> 
>>>> Jaak
>>>> 
>>>> 
>>>> On 20.05.2014 21:01, David Haslam wrote:
>>>>> Take care with Right to Left languages such as Hebrew.
>>>>> 
>>>>> i.e. After any patches to the filter, please include some 
>>>>> testing
>>>> for BiDi
>>>>> text in the About= field and others.
>>>>> 
>>>>> David
>>>>> 
>>>>> 
>>>>> 
>>>>> -- View this message in context:
>>>> http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
>>>>>
>>>>
>>>> 
Sent from the SWORD Dev mailing list archive at Nabble.com
>>>>> <http://Nabble.com>.
>>>>> 
>>>>> _______________________________________________
>>>>> sword-devel mailing list: sword-devel at crosswire.org 
>>>>> <mailto:sword-devel at crosswire.org>
>>>> <mailto:sword-devel at crosswire.org>
>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>>>>> Instructions to unsubscribe/change your settings at above 
>>>>> page
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________ sword-devel 
>>>> mailing list: sword-devel at crosswire.org 
>>>> <mailto:sword-devel at crosswire.org> 
>>>> <mailto:sword-devel at crosswire.org> 
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>>>> Instructions to unsubscribe/change your settings at above 
>>>> page
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________ sword-devel 
>>>> mailing list: sword-devel at crosswire.org 
>>>> <mailto:sword-devel at crosswire.org> 
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>>>> Instructions to unsubscribe/change your settings at above 
>>>> page
>>>> 
> 
>> 
>> _______________________________________________ sword-devel 
>> mailing list: sword-devel at crosswire.org 
>> <mailto:sword-devel at crosswire.org> 
>> http://www.crosswire.org/mailman/listinfo/sword-devel 
>> Instructions to unsubscribe/change your settings at above page
> 
> 
> 
> _______________________________________________ sword-devel
> mailing list: sword-devel at crosswire.org 
> http://www.crosswire.org/mailman/listinfo/sword-devel Instructions 
> to unsubscribe/change your settings at above page
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)

iQgcBAEBAgAGBQJTfbGVAAoJELozJlbjIn79zn4//3Jx81Qgjoj22zshBizjqjrM
Liky9QigioZFvoqTSdCp3E51S7ruYhK0CdKl44OL+/66RbeflTbvu/YPUkJswB8Y
lb/7e5HKUrVTVB2/pIU0OeRBFK0YLZl8JyupsHg6oidBTHt1yt5TMJMv1TeXaJYs
cYh4QwPH7Cn5yH2EzfVW9rSeUKyOwDSAWM4f3DyvsAKyIIHkZyZf3DtxhY6T81/4
FB8jCYq3Jrj3jihVOe9rjRafBmIGDXuQWmT4zlwmoZrXa7MrPdx2Cxmaa4rUu98c
AK5HDS7sD/LJslxYCmsMV3VXxdG4UMeM+/oLrl237Uh1vRjALtAx9rads1j/brtV
eNAoWfSNJDf3AHZW3CrHF5yiO8bTPUh6AdpNsQtfwg2FK4kF1EfZTW6lwRH/7HES
Z2TUYRATwpTUinRZxlF3CUQCdhldNQXFk2yEBmWr1ZtziPRd+3bqZBOmg1qSjN1/
PmqOS7Vxfsw1f7OvFdnFN03KAt2C0Rqo0OBSFgujJbb08PdvdZFIfUldnBXL5Slf
AQgOQpMpP4nX0V8S+GA4k+oQBxMYg7Ow3BWyj2ugc9PZ3wR07oeB91Mi+uEQIUK4
fdhIE3POwoeGYMuQoq6CvcGQ+fq4piNETnwGEKU2Gxi8yrGmLwbUl861Nx4VW6ar
y91D9n0Yiror3ziuAqmfp3PwIQjBcxsFev4HAZw+N7uXSR8WUGpPhmW+Fv5ulhHy
fkzNe8dTvY7qYebjLbD73nLLleyLp1CC+MnJ/pPvV59WyqxOT2s37ar97u5Ktqan
3NUvq9DxNB2A9W7PN20v61kxSbFvaWjKMvbXfpN+qvvLqHf0wfAS2o6Y8/JzuHrO
wsQNNgCXyugzRv1nIyP5ZjPTo9fcOUNxp+JmC60HpbKtElYD8e5DQQjNovcj7iTu
1zZgux2tSnc++pILLdu0XLeFOM0YO10wsYUt3uyKW6ldmpfKOzwYDZK1/2IIc40F
Y4wGZLTGayOV/H5LWbFszdyTIee678YJIT/rz9nxxxZMDO9F6ZfvBTZ3zolyE9/7
/lO4VOy7vSZZRsy5ecfSsApYVugNgYBy7KED2zAl/65DwPPLOw3y9OUhAWxxJ1hl
WOetXDilRCrlHrHQx88f5fhtYwNga1+Qv9rMJy6/gsQclSNs7AQ/bweGil8o4jqN
e59YGRgOou5k9eW9wY+RAGz6QvKN2qtq3djIn/5UudHI9NDi9lvkvGttURceOYCM
Is3r21LZvgKQorAtOumxienhauK31QmmO1qQcoKE07N+/4CiMCAPfSUE/E75mA2B
j81+hPt5/R4FLfa42hN6evL3286Al+7zYcB4VEfAWHzHUT4psNqJG5B5PdtkA+zA
TbmOgqkrgYmfA37PBLvAxpps0Zn2EZ+JtH/dcznijOMeiUmk59L+rxM9nzjXsJ2B
RzuhklK2h68Y/9G0CAki917l8UWz/S113+IsYCkfvo++EZHMmjLjktkKrkMGYhlQ
eppDE3cYKEEsLKHquMj4dMJdrjc7GOpYyUd8JETlWyHF13Zy7m7MgyWihDJf3Mre
g1axaEueASaA+MU3VPV2e/uiWphBRWmo07Ye8mnIC2O0Fnxzx5/YwYKFJK8bjVDy
iEH4rDohPoJENBJKV7hUyU3D89+pzUlOGKRTqWY2HQpOc9Hhd4GBfvvfbB3HAhYg
miWImi7Itx7h3VuuVbCCcZr6EucHD8uKPFsUjN1eqkEq9GyV4hj37MxN+1taGyZi
8yIYoHBa/OcHMWq+Wg85XC+IAYyNYxGEq0D07Ap3SabASw3B8D1FpjhfXi/ZqLMr
cgLIDNF6Gecm8Gq+Fdd4mA/Rhukavu8Kh1l1QUSTvdK6iV6a2RvWVW9WmEdrIpmK
Ko++rRUdCXBVpg8m9Wx6U16+6k2heYyvWeE4iqiuAWxM6d6SDMMOZpWGF1EJwzVP
bScm+PuiJi88CMcIBnap4YYzJc9BDpORz6ca/S9s0Z6Q53kdzc3pK2AJ2W2lIpJL
jFxAEdRBZBIHT+93clejyA3TXeSHUNvF6w+CBjcgDf4f+HOeB3KrcyjwEzpKZZjG
D5IxfoxQyR2oHp8JfFb65YFvRJ8Tm1U3SsrtODDxReHqZ9WTaH1DjScLpuOe0K87
ikK/CU9M0ipMLcdjn/VU312Qz+qSze1vRJz2J58GX/gjVyi773ccm7mhzdZ+EzbD
e6XsGH0poUXyyNSL4R2YGyDlegacZbAd5J+HlLFmN+9Ln8JAviP5lCMr/D1QokmU
BlW9WiKxVU72FxwO6Ohu432iFhLhhsGGVzkxvaiRzcIzf/b3A0neTp3qvKtZWeOG
v+XjxWw1Pz5ZzVp202t5jDZ/9CGl/wLbpVwdp4OUo5L+VMUXoXXApiEfpAA2mfBC
0J5CrKc5ywMMoOAiHyi6ZDQ3d51P4YT0fZyqgZIBSNrVUIGgf6bgTEEVB1e1uXkY
Ht4JoSVEmVNT60V2mMurJSGvFbYgMNmakCktv4i+P/tHDF05oXx1gmh2td1/Xqxz
pFe2PWPKEITsDr8MkpzZ/evDKfZcfxnx/HI6GSd1joXEiqcI8DMwfI8TUMRVXppy
EsyOxOGFdlex1WzCqXTH3HHja3Dm+IC2ery9ohcyTY4LYEYSVkfsJEtz5zOamzUy
P/FztoIp0sO7vKDOxMso8YIESMly/6wOjd9zvuUGtrsgtKd32WvpizaQK3uuNS3x
5bAjQAWdEcD9uL5JF9zl
=wEfl
-----END PGP SIGNATURE-----



More information about the sword-devel mailing list