[sword-devel] NFC and osis2mod
DM Smith
dmsmith555 at yahoo.com
Mon Feb 4 09:15:01 MST 2008
Chris,
Thanks. I'll see what I can do, but my time is also limited.
DM
Chris Little wrote:
> I've been meaning to work on this, but I thought I'd try to point you
> in the right direction since I'm pretty sure I won't have time in the
> next couple days. (I'll be at the UTC meeting at Apple--thinking about
> all things Unicode, but not working on them.) The bugs are almost
> definitely within utf8nfc.cpp. It's never been employed, to my
> knowledge, so it's never been debugged. It's also fairly old and
> should probably be updated to a newer version of the ICU API. (The
> could code work with the existing functions, but it might be best to
> update using some of the copious examples at ICU.)
>
> Your code looks fine to me. My old utf8nfc.cpp code looks a mess.
>
> --Chris
>
>
> On Jan 31, 2008, at 5:19 PM, DM Smith wrote:
>
>
>> Can someone offer some pointers as to what I am doing wrong?
>>
>> I am trying to add the ability to osis2mod to optionally ensure that a
>> UTF-8 document is normalized to NFC.
>>
>> I added -n as a flag to indicate that normalization should occur and
>> set a global boolean variable "normalize" to true iff the flag is
>> present.
>>
>> Rather than reinventing the wheel, I figured Sword's UTF8NFC filter
>> would be the ticket.
>>
>> First I added the header with:
>>
>> #ifdef _ICU_
>> #include <utf8nfc.h>
>> #endif
>>
>> And I created a global variable:
>>
>> #ifdef _ICU_
>> UTF8NFC normalizer;
>> #endif
>>
>>
>> Then right before adding the entry I ran it through the filter:
>>
>> #ifdef _ICU_
>> if (normalize) {
>> normalizer.processText(activeVerseText, (SWKey *)2); // note the
>> hack of 2 to mimic a real key. TODO: remove all hacks
>> }
>> #endif
>>
>> Now I ran the KJV.xml at www.crosswire.org/~dmsmith/kjv2006 through
>> osis2mod.
>>
>> Since I thought I had already normalized the text, I expected a diff
>> to show nothing.
>>
>> However I found corruption in Matthew 3:17 at the end of the raw text
>> in the module. (and many places later.)
>>
>> The corruption is always at the end of the line. Here is the raw text
>> for that verse:
>> <w lemma="strong:G3588" morph="robinson:T-NSM" src="13"></w><w
>> lemma="strong:G2532" morph="robinson:CONJ" src="1">And</w> <w
>> lemma="strong:G2400" morph="robinson:V-2AAM-2S" src="2">lo</w> <w
>> lemma="strong:G5456" morph="robinson:N-NSF" src="3">a voice</w> <w
>> lemma="strong:G1537" morph="robinson:PREP" src="4">from</w> <w
>> lemma="strong:G3588 strong:G3772" morph="robinson:T-GPM robinson:N-
>> GPM" src="5 6">heaven</w>, <w lemma="strong:G3004" morph="robinson:V-
>> PAP-NSF" src="7">saying</w>, <w lemma="strong:G3778"
>> morph="robinson:D-
>> NSM" src="8">This</w> <w lemma="strong:G2076" morph="robinson:V-
>> PXI-3S" src="9">is</w> <w lemma="strong:G3450" morph="robinson:P-1GS"
>> src="12">my</w> <w lemma="strong:G27" morph="robinson:A-NSM"
>> src="14">beloved</w> <w lemma="strong:G3588 strong:G5207"
>> morph="robinson:T-NSM robinson:N-NSM" src="10 11">Son</w>, <w
>> lemma="strong:G1722" morph="robinson:PREP" src="15">in</w> <w
>> lemma="strong:G3739" morph="robinson:R-DSM" src="16">whom</w> <w
>> lemma="strong:G2106" morph="robinson:V-AAI-1S" src="17">I am well
>> pleased</w>.<milestone resp="pdy 2003-12-14-08:48" type="x-
>> strongsMarkup"/>="22"꧁
>>
>>
>> Any help would be appreciated.
>>
>> Thanks!
>>
>> Working together,
>> DM Smith
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel
mailing list