[sword-devel] usfm2osis.py
Chris Little
chrislit at crosswire.org
Sun Aug 5 19:20:41 MST 2012
On 8/5/2012 5:28 PM, Greg Hellings wrote:
> On Sun, Aug 5, 2012 at 7:19 PM, Chris Little <chrislit at crosswire.org> wrote:
>>
>>
>> On Aug 5, 2012, at 11:37 AM, David Haslam <dfhmch at googlemail.com> wrote:
>>
>>> FWIW, I just came across this http://www.pythonregex.com/ Python Regular
>>> Expression Testing Tool
>>>
>>> Does Python support the full 21-bit Unicode range?
>>>
>>> cf. Many other regular expression engines only support the Basic
>>> Multilingual Plane.
>>>
>>
>> Yes, Python regex supports non-BMP characters. The language tags are Plane 14, I believe. An engine that supports only the BMP can't be said to support Unicode and is probably just processing bytes.
>>
>
> As further explanation, Python differentiates between the "string"
> object, which is 8-bit encoding representation of objects in any
> selected encoding and "unicode" objects which are strings of Unicode
> characters. The exact internal representation probably differs between
> CPython and Jython. CPython used to use UCS-2 but now can use either
> UCS-2 or UCS-4 since the extension of the BMP.
>
> To read more details see
> http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
> under the heading "Internal Representation".
Oh. Well, that's annoying.
To see whether your Python interpreter is compiled with UCS-2 or UCS-4,
you can run this from the interpreter:
import sys
sys.maxunicode
If it returns 65535, it's using UCS-2. If 1114111, then UCS-4.
Linux packagers apparently go the UCS-4 route, so I didn't notice any
issue with using the Language Tags. But trying the above on Windows
shows that the cygwin build and the builds from python.org (2.7 & 3.2)
all use UCS-2. So my script won't work correctly on Windows.
Not to worry, though. I'll just replace the Language Tags with
Noncharacters in the range u+FDD0-u+FDEF. They're UCS-2-safe since
they're BMP codepoints and they're specifically designated as "intended
for process-internal uses, but are not permitted for interchange." So in
the unlikely event that they appear in input, it's the fault of the
USFM-encoder if anything goes awry.
We'll have to watch for input outside of the BMP on UCS-2 Python,
though, as that could cause problems.
--Chris
More information about the sword-devel
mailing list