[sword-devel] XML Numeric character references (entities) in BibleCS
Chris Little
chrislit at crosswire.org
Thu Jan 31 17:37:01 MST 2008
On Jan 31, 2008, at 3:08 PM, DM Smith wrote:
> I imagine there is a C/C++ routine that will convert from an entities
> codepoint to a UTF-8 Character.
The numeric entities can presumably be interpreted as UTF-32 and
encoded as UTF-8 on that basis using either ICU's routines or those in
Sword. The one hangup might be if someone encodes UTF-16 surrogate
pairs as entities. I'm not even sure whether that is legal, much less
how likely it would be for someone to do.
> I'm working on adding -n to osis2mod that will normalize UTF-8 to NFC.
> There's a bug in it and I'll be posting separately about it.
Are you using ICU? There's code in utf8nfc.cpp (in the filters
directory) that should work to do the translation. We might even be
able to use ICU to solve the surrogates issue with a little work.
--Chris
More information about the sword-devel
mailing list