[sword-devel] RTFHTML filter bugs
DM Smith
dmsmith at crosswire.org
Thu May 22 04:18:53 MST 2014
The Encoding field drives the encoding of the file. When not present use the default.
The front end should never read the file. It is the engine's responsibility to do the reading. It is not the reading of the file that may need to be done twice but rather the byte stream/buffer from the file. How it gets the byte stream/buffer for the second (failure) case is its business.
It could *always* read it twice. First time as binary to read the ASCII content of the Encoding= field. The second time to do the charset conversion. But I'm not recommending that.
Btw I work on JSword which parses as it reads the stream from the file. It rewinds the stream if it is UTF-8 and rereads. It is not error prone.
This complexity is due to that's the way it is and we need to support legacy confs.
> On May 22, 2014, at 4:13 AM, Jaak Ristioja <jaak at ristioja.ee> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I think I don't understand what you're saying. The frontend should
> read the configuration twice? Sword should? Huh? I don't understand.
> Why this complexity?! Do you mean that:
>
> * Sword reads the entire configuration file as CP1252 encoded
> * On failure, re-read the configuration file as UTF-8 encoded
>
> ???
>
> If this is the case, then this is error prone (even when reading only
> parts of the configuration), because CP1252 and UTF-8 overlap. Hence
> data encoded as UTF-8 might be parsed correctly as valid CP1252, even
> though it was intended to be UTF-8. I mean I find it likely that valid
> UTF-8 strings might be accepted by a perfectly correct CP1252 encoding
> checker as valid CP1252.
>
> Jaak
>
>> On 21.05.2014 17:45, DM Smith wrote:
>> The encoding of the conf is either cp1252 (the default, but called
>> latin 1) or utf-8. The encoding of the conf matches that of the
>> module. This may cause the conf to be read twice once for the
>> default and once for UTF-8, if the module encoding is set to
>> UTF-8.
>>
>> There have been confs that are incorrect with regard to this rule.
>>
>> In Him, DM
>>
>> On May 21, 2014, at 8:59 AM, Jaak Ristioja <jaak at ristioja.ee
>> <mailto:jaak at ristioja.ee>> wrote:
>>
>> So this means that actually we want non-standard RTF (someone
>> should update the wiki). Should we assume UTF-8? Are you sure we
>> don't have any modules with ISO-8859-something encoded values?
>>
>> If we choose any ASCII superset encoding we have to consider at
>> least the two points:
>>
>> * Since the RTF control words and delimeters are specified in ASCII
>> only, we need to decide whether how the bytes of the superset act
>> as delimeters and parts of "RTF" control words. For example,
>> whether the Unicode letter, number, spacing, punctuation, control
>> etc characters constitute parts of RTF control words or act as
>> delimiters.
>>
>> * In case of encodings where characters may consist of multiple
>> bytes (e.g. the variable-length UTF-8) we must consider the
>> character bondaries. We can't just pass through any non-ASCII byte
>> values. For example, the following bit sequence wouldn't make
>> sense:
>>
>> 11100010 01011100 10000010 01110001 10101100 01100011
>>
>> which is an UTF-8 encoded Euro sign, €, interleaved with bytes of
>> the ASCII string "\qc". It just doesn't make sense, whereas the
>> following sequences would be correct:
>>
>> 11100010 10000010 10101100 01011100 01110001 01100011 (€\qc)
>> 01011100 01110001 01100011 11100010 10000010 10101100 (\qc€)
>>
>> So depending on the encoding it were correct to detect such cases,
>> otherwise we end up with invalid Unicode output.
>>
>> Blessings, Jaak
>>
>> On 21.05.2014 15:19, Chris Burrell wrote:
>>>>> I believe some conf files have direct unicode (rather than
>>>>> escaped sequences) in them and that is preferred.
>>>>>
>>>>> On 20 May 2014 23:28, "Jaak Ristioja" <jaak at ristioja.ee
>>>>> <mailto:jaak at ristioja.ee> <mailto:jaak at ristioja.ee>> wrote:
>>>>>
>>>>> I've never done BiDi, but I'm not sure I need to take that
>>>>> into account while fixing the RTF parsing. As I currently
>>>>> understand it, this particular piece of code does not
>>>>> support any part from the RTF spec dealing with bidirectional
>>>>> text handling. Hence all BiDi information contained in the
>>>>> configuration file strings (e.g. About=) is contained either
>>>>> in the plain ASCII text or the \u<num> Unicode escapes which
>>>>> this algorithm should pass through unmodified.
>>>>>
>>>>> ...except for HTML entities which should actually be
>>>>> escaped. This bug in the algorithm I previously failed to
>>>>> notice. Additionally I forgot that non-ASCII characters in
>>>>> the input string should also lead to parsing failure.
>>>>>
>>>>> Jaak
>>>>>
>>>>>
>>>>>> On 20.05.2014 21:01, David Haslam wrote:
>>>>>> Take care with Right to Left languages such as Hebrew.
>>>>>>
>>>>>> i.e. After any patches to the filter, please include some
>>>>>> testing
>>>>> for BiDi
>>>>>> text in the About= field and others.
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- View this message in context:
>>>>> http://sword-dev.350566.n4.nabble.com/RTFHTML-filter-bugs-tp4653969p4653970.html
> Sent from the SWORD Dev mailing list archive at Nabble.com
>>>>>> <http://Nabble.com>.
>>>>>>
>>>>>> _______________________________________________
>>>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>>>> <mailto:sword-devel at crosswire.org>
>>>>> <mailto:sword-devel at crosswire.org>
>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>> Instructions to unsubscribe/change your settings at above
>>>>>> page
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________ sword-devel
>>>>> mailing list: sword-devel at crosswire.org
>>>>> <mailto:sword-devel at crosswire.org>
>>>>> <mailto:sword-devel at crosswire.org>
>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>> Instructions to unsubscribe/change your settings at above
>>>>> page
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________ sword-devel
>>>>> mailing list: sword-devel at crosswire.org
>>>>> <mailto:sword-devel at crosswire.org>
>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>> Instructions to unsubscribe/change your settings at above
>>>>> page
>>
>>>
>>> _______________________________________________ sword-devel
>>> mailing list: sword-devel at crosswire.org
>>> <mailto:sword-devel at crosswire.org>
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>>
>>
>> _______________________________________________ sword-devel
>> mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel Instructions
>> to unsubscribe/change your settings at above page
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.22 (GNU/Linux)
>
> iQgcBAEBAgAGBQJTfbGVAAoJELozJlbjIn79zn4//3Jx81Qgjoj22zshBizjqjrM
> Liky9QigioZFvoqTSdCp3E51S7ruYhK0CdKl44OL+/66RbeflTbvu/YPUkJswB8Y
> lb/7e5HKUrVTVB2/pIU0OeRBFK0YLZl8JyupsHg6oidBTHt1yt5TMJMv1TeXaJYs
> cYh4QwPH7Cn5yH2EzfVW9rSeUKyOwDSAWM4f3DyvsAKyIIHkZyZf3DtxhY6T81/4
> FB8jCYq3Jrj3jihVOe9rjRafBmIGDXuQWmT4zlwmoZrXa7MrPdx2Cxmaa4rUu98c
> AK5HDS7sD/LJslxYCmsMV3VXxdG4UMeM+/oLrl237Uh1vRjALtAx9rads1j/brtV
> eNAoWfSNJDf3AHZW3CrHF5yiO8bTPUh6AdpNsQtfwg2FK4kF1EfZTW6lwRH/7HES
> Z2TUYRATwpTUinRZxlF3CUQCdhldNQXFk2yEBmWr1ZtziPRd+3bqZBOmg1qSjN1/
> PmqOS7Vxfsw1f7OvFdnFN03KAt2C0Rqo0OBSFgujJbb08PdvdZFIfUldnBXL5Slf
> AQgOQpMpP4nX0V8S+GA4k+oQBxMYg7Ow3BWyj2ugc9PZ3wR07oeB91Mi+uEQIUK4
> fdhIE3POwoeGYMuQoq6CvcGQ+fq4piNETnwGEKU2Gxi8yrGmLwbUl861Nx4VW6ar
> y91D9n0Yiror3ziuAqmfp3PwIQjBcxsFev4HAZw+N7uXSR8WUGpPhmW+Fv5ulhHy
> fkzNe8dTvY7qYebjLbD73nLLleyLp1CC+MnJ/pPvV59WyqxOT2s37ar97u5Ktqan
> 3NUvq9DxNB2A9W7PN20v61kxSbFvaWjKMvbXfpN+qvvLqHf0wfAS2o6Y8/JzuHrO
> wsQNNgCXyugzRv1nIyP5ZjPTo9fcOUNxp+JmC60HpbKtElYD8e5DQQjNovcj7iTu
> 1zZgux2tSnc++pILLdu0XLeFOM0YO10wsYUt3uyKW6ldmpfKOzwYDZK1/2IIc40F
> Y4wGZLTGayOV/H5LWbFszdyTIee678YJIT/rz9nxxxZMDO9F6ZfvBTZ3zolyE9/7
> /lO4VOy7vSZZRsy5ecfSsApYVugNgYBy7KED2zAl/65DwPPLOw3y9OUhAWxxJ1hl
> WOetXDilRCrlHrHQx88f5fhtYwNga1+Qv9rMJy6/gsQclSNs7AQ/bweGil8o4jqN
> e59YGRgOou5k9eW9wY+RAGz6QvKN2qtq3djIn/5UudHI9NDi9lvkvGttURceOYCM
> Is3r21LZvgKQorAtOumxienhauK31QmmO1qQcoKE07N+/4CiMCAPfSUE/E75mA2B
> j81+hPt5/R4FLfa42hN6evL3286Al+7zYcB4VEfAWHzHUT4psNqJG5B5PdtkA+zA
> TbmOgqkrgYmfA37PBLvAxpps0Zn2EZ+JtH/dcznijOMeiUmk59L+rxM9nzjXsJ2B
> RzuhklK2h68Y/9G0CAki917l8UWz/S113+IsYCkfvo++EZHMmjLjktkKrkMGYhlQ
> eppDE3cYKEEsLKHquMj4dMJdrjc7GOpYyUd8JETlWyHF13Zy7m7MgyWihDJf3Mre
> g1axaEueASaA+MU3VPV2e/uiWphBRWmo07Ye8mnIC2O0Fnxzx5/YwYKFJK8bjVDy
> iEH4rDohPoJENBJKV7hUyU3D89+pzUlOGKRTqWY2HQpOc9Hhd4GBfvvfbB3HAhYg
> miWImi7Itx7h3VuuVbCCcZr6EucHD8uKPFsUjN1eqkEq9GyV4hj37MxN+1taGyZi
> 8yIYoHBa/OcHMWq+Wg85XC+IAYyNYxGEq0D07Ap3SabASw3B8D1FpjhfXi/ZqLMr
> cgLIDNF6Gecm8Gq+Fdd4mA/Rhukavu8Kh1l1QUSTvdK6iV6a2RvWVW9WmEdrIpmK
> Ko++rRUdCXBVpg8m9Wx6U16+6k2heYyvWeE4iqiuAWxM6d6SDMMOZpWGF1EJwzVP
> bScm+PuiJi88CMcIBnap4YYzJc9BDpORz6ca/S9s0Z6Q53kdzc3pK2AJ2W2lIpJL
> jFxAEdRBZBIHT+93clejyA3TXeSHUNvF6w+CBjcgDf4f+HOeB3KrcyjwEzpKZZjG
> D5IxfoxQyR2oHp8JfFb65YFvRJ8Tm1U3SsrtODDxReHqZ9WTaH1DjScLpuOe0K87
> ikK/CU9M0ipMLcdjn/VU312Qz+qSze1vRJz2J58GX/gjVyi773ccm7mhzdZ+EzbD
> e6XsGH0poUXyyNSL4R2YGyDlegacZbAd5J+HlLFmN+9Ln8JAviP5lCMr/D1QokmU
> BlW9WiKxVU72FxwO6Ohu432iFhLhhsGGVzkxvaiRzcIzf/b3A0neTp3qvKtZWeOG
> v+XjxWw1Pz5ZzVp202t5jDZ/9CGl/wLbpVwdp4OUo5L+VMUXoXXApiEfpAA2mfBC
> 0J5CrKc5ywMMoOAiHyi6ZDQ3d51P4YT0fZyqgZIBSNrVUIGgf6bgTEEVB1e1uXkY
> Ht4JoSVEmVNT60V2mMurJSGvFbYgMNmakCktv4i+P/tHDF05oXx1gmh2td1/Xqxz
> pFe2PWPKEITsDr8MkpzZ/evDKfZcfxnx/HI6GSd1joXEiqcI8DMwfI8TUMRVXppy
> EsyOxOGFdlex1WzCqXTH3HHja3Dm+IC2ery9ohcyTY4LYEYSVkfsJEtz5zOamzUy
> P/FztoIp0sO7vKDOxMso8YIESMly/6wOjd9zvuUGtrsgtKd32WvpizaQK3uuNS3x
> 5bAjQAWdEcD9uL5JF9zl
> =wEfl
> -----END PGP SIGNATURE-----
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel
mailing list