[sword-devel] NFC Normalization and osis2mod

Sat Feb 23 07:29:23 MST 2008

On second thought... I think we'd better go with the C++ interface code 
you posted (and we should apply it across the other filters as 
appropriate). I still want to run some tests to see if it's really going 
to kill performance when we run it on large chunks of data, but I would 
guess we can commit your patch tomorrow.

In order to do things right, using the C interface, we would basically 
have to re-implement all of the inefficient parts hidden behind the C++ 
interface anyway. Specifically I'm thinking of the pre-flighting of 
conversion calls to determine string sizes when we're converting between 
UTF-8 & UTF-16 and also when we run various normalizations, converters, 
& transliterators on the UTF-16 itself.

DM Smith wrote:
> On Feb 23, 2008, at 7:46 AM, Chris Little wrote:
>> X*2 could be either doubling the byte size to accommodate conversion
>> from 8-bit chars to 16-bit chars OR could be acceptance of the fact  
>> that
>> characters we encounter might actually be represented as surrogate  
>> pairs
>> in UTF-16. (ICU uses UTF-16 internally.)
> 
> I don't think the former applies. SWBuf.length() will return the  
> number of bytes in the array, which will be either equal or greater  
> than the number of UTF-8 characters. I think that a UChar is the size  
> of a UTF-16 character, so the receiving buffer, source, needs only to  
> be big enough for the maximal number of UTF-16 bytes.
> 
> There are comments that the *2 represents space for surrogate pairs.

ICU UChars are 16-bits long. A character in UTF-16 can be either one or 
two 16-bit shorts long. If the character is in Plane 0 (the BMP) then 
it's one short long. If it's outside Plane 0, then it will be 
represented by a surrogate pair (2 shorts). So the number of UChars in a 
string might be double the number of characters in that string.

Now that I think about it, the number of UTF-8 bytes necessary to 
represent a character is always greater than or equal to the number of 
UTF-16 shorts necessary to represent it, but this is all the sort of 
thinking I'd like to avoid by using your patch, assuming the C++ 
interface doesn't slow things down too badly in actual usage. 
Normalization itself could cause growth of the string size that I don't 
really want to think about.

--Chris