[sword-devel] NFC Normalization and osis2mod
Chris Little
chrislit at crosswire.org
Sat Feb 23 07:29:23 MST 2008
On second thought... I think we'd better go with the C++ interface code
you posted (and we should apply it across the other filters as
appropriate). I still want to run some tests to see if it's really going
to kill performance when we run it on large chunks of data, but I would
guess we can commit your patch tomorrow.
In order to do things right, using the C interface, we would basically
have to re-implement all of the inefficient parts hidden behind the C++
interface anyway. Specifically I'm thinking of the pre-flighting of
conversion calls to determine string sizes when we're converting between
UTF-8 & UTF-16 and also when we run various normalizations, converters,
& transliterators on the UTF-16 itself.
DM Smith wrote:
> On Feb 23, 2008, at 7:46 AM, Chris Little wrote:
>> X*2 could be either doubling the byte size to accommodate conversion
>> from 8-bit chars to 16-bit chars OR could be acceptance of the fact
>> that
>> characters we encounter might actually be represented as surrogate
>> pairs
>> in UTF-16. (ICU uses UTF-16 internally.)
>
> I don't think the former applies. SWBuf.length() will return the
> number of bytes in the array, which will be either equal or greater
> than the number of UTF-8 characters. I think that a UChar is the size
> of a UTF-16 character, so the receiving buffer, source, needs only to
> be big enough for the maximal number of UTF-16 bytes.
>
> There are comments that the *2 represents space for surrogate pairs.
ICU UChars are 16-bits long. A character in UTF-16 can be either one or
two 16-bit shorts long. If the character is in Plane 0 (the BMP) then
it's one short long. If it's outside Plane 0, then it will be
represented by a surrogate pair (2 shorts). So the number of UChars in a
string might be double the number of characters in that string.
Now that I think about it, the number of UTF-8 bytes necessary to
represent a character is always greater than or equal to the number of
UTF-16 shorts necessary to represent it, but this is all the sort of
thinking I'd like to avoid by using your patch, assuming the C++
interface doesn't slow things down too badly in actual usage.
Normalization itself could cause growth of the string size that I don't
really want to think about.
--Chris
More information about the sword-devel
mailing list