[sword-devel] NFC Normalization and osis2mod

Sat Feb 23 11:33:05 MST 2008

Chris,

In running the KJV through the filter, the old buggy way and my new  
way, I did not see much difference. The OO way might actually be faster.

Using the filter vs not using it is obviously a time hit. That's why I  
made it optional in osis2mod. One might already have nfc (or cp1252)  
and not need the filter.

None of the filters test for errors when marshalling to and from a  
swbuf. Perhaps they should.

In reading the documentation on UnicodeString, there may be a better  
way than what I coded to do the extraction:

Instead of:
         text.setSize(text.size()*2); // potentially, it can grow to  
2x the original size
         int32_t len = target.extract(text.getRawData(), text.size(),  
conv, status);
         text.setSize(len);

Perhaps, do something reusable like this (I wrote this off the top of  
my head, so it may need help. I kept wanting to do it in Perl ;) ):
void fromUnicodeString(SWBuf &text, UnicodeString &target, UConverter  
*conv) {
         UErrorCode status = U_ZERO_ERROR;
	bool tryAgain = false;
	unsigned long size = text.size();
         do (
      		int32_t len = target.extract(text.getRawData(), size, conv,  
status);

		if (tryAgain = (len > size)) { // change this if you don't like  
testing an assignment
			// Our SWBuf was too small and if so, resize it to try again.
  			text.setSize(len);
			size = len;
		}
		else if (len < size) {
			// Our SWBuf was too big. So stuff a null byte at the right location.
			text.setSize(len);
		}
	} while (tryAgain);
}

This is slightly more reliable and perhaps a bit faster.
a) More reliable because it checks for an output buffer that's too  
small. Whereas the first does no checking and assumes that the output  
buffer is sized to bigger than it needs to be. The second does error  
checking and recovery.
b) More reliable, because it is written once used many.
c) Potentially faster because the input is already UTF-8. While it  
can, I think the normalization process is highly unlikely to increase  
the number of bytes. It is more likely to shrink it. This saves the re- 
allocation to a failure condition. It should only enter the test block  
once.

Something similar can be done to create a UnicodeString from a SWBuf,  
but it is not much of a gain.
UnicodeString& toUnicodeString(SWBuf &text)
{
         UErrorCode status = U_ZERO_ERROR;
	UnicodeString source(text.getRawData(), text.length(), conv, status);
	return source;
}

DM

On Feb 23, 2008, at 9:29 AM, Chris Little wrote:

> On second thought... I think we'd better go with the C++ interface  
> code
> you posted (and we should apply it across the other filters as
> appropriate). I still want to run some tests to see if it's really  
> going
> to kill performance when we run it on large chunks of data, but I  
> would
> guess we can commit your patch tomorrow.
>
> In order to do things right, using the C interface, we would basically
> have to re-implement all of the inefficient parts hidden behind the C 
> ++
> interface anyway. Specifically I'm thinking of the pre-flighting of
> conversion calls to determine string sizes when we're converting  
> between
> UTF-8 & UTF-16 and also when we run various normalizations,  
> converters,
> & transliterators on the UTF-16 itself.
>
> DM Smith wrote:
>> On Feb 23, 2008, at 7:46 AM, Chris Little wrote:
>>> X*2 could be either doubling the byte size to accommodate conversion
>>> from 8-bit chars to 16-bit chars OR could be acceptance of the fact
>>> that
>>> characters we encounter might actually be represented as surrogate
>>> pairs
>>> in UTF-16. (ICU uses UTF-16 internally.)
>>
>> I don't think the former applies. SWBuf.length() will return the
>> number of bytes in the array, which will be either equal or greater
>> than the number of UTF-8 characters. I think that a UChar is the size
>> of a UTF-16 character, so the receiving buffer, source, needs only to
>> be big enough for the maximal number of UTF-16 bytes.
>>
>> There are comments that the *2 represents space for surrogate pairs.
>
> ICU UChars are 16-bits long. A character in UTF-16 can be either one  
> or
> two 16-bit shorts long. If the character is in Plane 0 (the BMP) then
> it's one short long. If it's outside Plane 0, then it will be
> represented by a surrogate pair (2 shorts). So the number of UChars  
> in a
> string might be double the number of characters in that string.
>
> Now that I think about it, the number of UTF-8 bytes necessary to
> represent a character is always greater than or equal to the number of
> UTF-16 shorts necessary to represent it, but this is all the sort of
> thinking I'd like to avoid by using your patch, assuming the C++
> interface doesn't slow things down too badly in actual usage.
> Normalization itself could cause growth of the string size that I  
> don't
> really want to think about.
>
> --Chris
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page