[sword-devel] usfm2osis.py

Sun Aug 5 19:20:41 MST 2012

On 8/5/2012 5:28 PM, Greg Hellings wrote:
> On Sun, Aug 5, 2012 at 7:19 PM, Chris Little <chrislit at crosswire.org> wrote:
>>
>>
>> On Aug 5, 2012, at 11:37 AM, David Haslam <dfhmch at googlemail.com> wrote:
>>
>>> FWIW, I just came across this  http://www.pythonregex.com/ Python Regular
>>> Expression Testing Tool
>>>
>>> Does Python support the full 21-bit Unicode range?
>>>
>>> cf. Many other regular expression engines only support the Basic
>>> Multilingual Plane.
>>>
>>
>> Yes, Python regex supports non-BMP characters. The language tags are Plane 14, I believe. An engine that supports only the BMP can't be said to support Unicode and is probably just processing bytes.
>>
>
> As further explanation, Python differentiates between the "string"
> object, which is 8-bit encoding representation of objects in any
> selected encoding and "unicode" objects which are strings of Unicode
> characters. The exact internal representation probably differs between
> CPython and Jython. CPython used to use UCS-2 but now can use either
> UCS-2 or UCS-4 since the extension of the BMP.
>
> To read more details see
> http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
> under the heading "Internal Representation".

Oh. Well, that's annoying.

To see whether your Python interpreter is compiled with UCS-2 or UCS-4, 
you can run this from the interpreter:

import sys
sys.maxunicode

If it returns 65535, it's using UCS-2. If 1114111, then UCS-4.

Linux packagers apparently go the UCS-4 route, so I didn't notice any 
issue with using the Language Tags. But trying the above on Windows 
shows that the cygwin build and the builds from python.org (2.7 & 3.2) 
all use UCS-2. So my script won't work correctly on Windows.

Not to worry, though. I'll just replace the Language Tags with 
Noncharacters in the range u+FDD0-u+FDEF. They're UCS-2-safe since 
they're BMP codepoints and they're specifically designated as "intended 
for process-internal uses, but are not permitted for interchange." So in 
the unlikely event that they appear in input, it's the fault of the 
USFM-encoder if anything goes awry.

We'll have to watch for input outside of the BMP on UCS-2 Python, 
though, as that could cause problems.

--Chris