[sword-devel] NFC Normalization and osis2mod
DM Smith
dmsmith555 at yahoo.com
Sat Feb 23 06:53:49 MST 2008
On Feb 23, 2008, at 7:46 AM, Chris Little wrote:
>
>
> DM Smith wrote:
>> The thing I noticed in Sword's ICU filters is that it was not
>> consistent
>> in how it set up the UChar array or converted that back to a SWBuf.
>
> Thanks for digging through everything. I will see if I can't make
> things
> a little more consistent once I get UTF8NFC debugged.
>
>> The setup may be wrong:
>> int32_t len = text.length() * 2;
>> source = new UChar[len + 1];
>> len = ucnv_toUChars(conv, source, len, text.c_str(), -1,
>> &err);
>
> Yes, that's where I'm focusing my attention.
>
>> Many of the filters just use text.length(), one uses
>> text.length()*2+1,
>> another 5+text.length()*5 and only this one uses text.length()*2.
>
> Well, here are some guesses as to what these might have come from....
>
> X+1 is probably making room for a null termination (probably
> unnecessary
> since everything is null terminated to begin with).
The SWBuf is null terminated.
From what I can understand from the ICU docs:
UChar buffers do not need to be, but they can be. The +1 is necessary
to ensure space for a null terminator. If the UChar is not null
terminated, then the actual length needs to be remembered at every
stage.
>
>
> X*2 could be either doubling the byte size to accommodate conversion
> from 8-bit chars to 16-bit chars OR could be acceptance of the fact
> that
> characters we encounter might actually be represented as surrogate
> pairs
> in UTF-16. (ICU uses UTF-16 internally.)
I don't think the former applies. SWBuf.length() will return the
number of bytes in the array, which will be either equal or greater
than the number of UTF-8 characters. I think that a UChar is the size
of a UTF-16 character, so the receiving buffer, source, needs only to
be big enough for the maximal number of UTF-16 bytes.
There are comments that the *2 represents space for surrogate pairs.
>
>
> X*5 is probably allowing for expansion from a character to its UTF-8
> representation, which is maximally 5-bytes long.
This is only used in the nfkd filter. So *5 probably represents the
maximal size of a decomposition. I have no guess as to why +5.
I'm not familiar with surrogate pairs, but it appears that there is no
accounting for them.
>
>
> I'll get it all sorted out eventually, but those are what those
> numbers
> probably represent.
>
>
> I had a bit of difficulty getting BCB5 installed and working in Vista,
> but I think I've got everything running well enough for the moment so
> that I can get to work on this.
Many thanks!
More information about the sword-devel
mailing list