[sword-devel] NFC Normalization and osis2mod

Sat Feb 23 06:53:49 MST 2008

On Feb 23, 2008, at 7:46 AM, Chris Little wrote:

>
>
> DM Smith wrote:
>> The thing I noticed in Sword's ICU filters is that it was not  
>> consistent
>> in how it set up the UChar array or converted that back to a SWBuf.
>
> Thanks for digging through everything. I will see if I can't make  
> things
> a little more consistent once I get UTF8NFC debugged.
>
>> The setup may be wrong:
>>        int32_t len = text.length() * 2;
>>        source = new UChar[len + 1];
>>        len = ucnv_toUChars(conv, source, len, text.c_str(), -1,  
>> &err);
>
> Yes, that's where I'm focusing my attention.
>
>> Many of the filters just use text.length(), one uses  
>> text.length()*2+1,
>> another 5+text.length()*5 and only this one uses text.length()*2.
>
> Well, here are some guesses as to what these might have come from....
>
> X+1 is probably making room for a null termination (probably  
> unnecessary
> since everything is null terminated to begin with).

The SWBuf is null terminated.

 From what I can understand from the ICU docs:
UChar buffers do not need to be, but they can be. The +1 is necessary  
to ensure space for a null terminator. If the UChar is not null  
terminated, then the actual length needs to be remembered at every  
stage.

>
>
> X*2 could be either doubling the byte size to accommodate conversion
> from 8-bit chars to 16-bit chars OR could be acceptance of the fact  
> that
> characters we encounter might actually be represented as surrogate  
> pairs
> in UTF-16. (ICU uses UTF-16 internally.)

I don't think the former applies. SWBuf.length() will return the  
number of bytes in the array, which will be either equal or greater  
than the number of UTF-8 characters. I think that a UChar is the size  
of a UTF-16 character, so the receiving buffer, source, needs only to  
be big enough for the maximal number of UTF-16 bytes.

There are comments that the *2 represents space for surrogate pairs.

>
>
> X*5 is probably allowing for expansion from a character to its UTF-8
> representation, which is maximally 5-bytes long.

This is only used in the nfkd filter. So *5 probably represents the  
maximal size of a decomposition. I have no guess as to why +5.

I'm not familiar with surrogate pairs, but it appears that there is no  
accounting for them.

>
>
> I'll get it all sorted out eventually, but those are what those  
> numbers
> probably represent.

>
>
> I had a bit of difficulty getting BCB5 installed and working in Vista,
> but I think I've got everything running well enough for the moment so
> that I can get to work on this.

Many thanks!