<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Feb 24, 2008, at 4:46 PM, Chris Little wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><br><br>DM Smith wrote:<br><blockquote type="cite">I have added a -n flag to osis2mod.<br></blockquote><br>I'm going to add it to the other major importers (osis2gbs & imp2*) just <br>as soon as I get things into a fairly stable state.<br><br><blockquote type="cite">This flag, to be enabled, requires osis2mod to be compiled with ICU <br></blockquote><blockquote type="cite">support enabled.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">-n stands for normalized to NFC, the agreed upon UTF-8 encoding<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">When should this flag be used?<br></blockquote><blockquote type="cite">1) When the input is UTF-8<br></blockquote><blockquote type="cite">and<br></blockquote><blockquote type="cite">2) It is not known to be NFC<br></blockquote><br>First, I feel like there's really no reason NOT to perform <br>normalization, provided that the input is UTF-8. Even if the input is <br>already in NFC, it won't hurt anything to do it again. It will take <br>extra time to compile the module, but I feel like it's better to be safe <br>than sorry in this case.</blockquote><div><br class="webkit-block-placeholder"></div><div><br class="webkit-block-placeholder"></div>I mostly agree. But once I know that the module is NFC, I'd rather not take the hit. I must have made the KJV into a module 100 or more times before I got it right.</div><div><br class="webkit-block-placeholder"></div><div><br><blockquote type="cite"><br><br>Second, your comment about needing UTF-8 input makes me think we should <br>go ahead and add encoding conversion to the importers as well, possibly <br>with automatic charset detection.<br></blockquote></div><br><div>I'd like to see OSIS modules also be UTF-8.</div><div><br></div><div>What mechanism were you thinking of for automatic charset detection? I have a buggy routine to detect whether something is UTF-8, 7-bit ascii or other. We could use that (once I fix it).</div><div><br class="webkit-block-placeholder"></div><div>As to automatic charset detection, could we require that every input to osis2mod have:</div><div><?xml version="1.0" encoding="UTF-8"?></div><div>or</div><div><?xml version="1.0" encoding="cp1252"?></div><div>and use whatever is the value for the encoding attribute?</div><div><br class="webkit-block-placeholder"></div><div><br class="webkit-block-placeholder"></div><div>-- DM</div><div><br></div><div><span class="Apple-style-span" style="font-size: 13px; line-height: 19px; "></span></div></body></html>