<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto">Re case sensitivity, it was just a very simple example of the principle.<div><br></div><div>If it doesn’t find all then the search request and the index were not normalized the same. NFC and NFD are different normalizations.</div><div><br></div><div>Note, stripping diacritics may be an appropriate normalization.</div><div><br></div><div>JSword doesn’t properly handle Unicode either. <br><div><br><div id="AppleMailSignature">— DM Smith<div>From my phone. Brief. Weird autocorrections. </div></div><div><br>On Mar 22, 2018, at 9:30 AM, David Haslam <<a href="mailto:dfhdfh@protonmail.com">dfhdfh@protonmail.com</a>> wrote:<br><br></div><blockquote type="cite"><div><div>Thanks, DM.<br></div><div><br></div><div>My question was not about case-sensitivity, but about <b>Unicode normalization</b>.<br></div><div>The main issue is <b>composition</b> vs <b>decomposition</b> and the <b>canonical ordering</b> of diacritics in each glyph.<br></div><div><br></div><div>e.g. Suppose the module contains 181 instances of the name "<b>Efraím</b>" which has 6 characters.<br></div><div>Suppose a user enters in the search box instead "<b>E f r a i ́ m</b>" - (NB, remove the spaces!)<br></div><div>That's 7 characters when normalized to <b>NFD</b>, the <u>acute accent</u> now being a separate character (U+0301 COMBINING ACUTE ACCENT).</div><div><br></div><div>In each front-end, will the search function find all the 181 instances (as <b>Eloquent</b> does) ?<br></div><div>Or (as with <b>Xiphos</b>) will it find none?<br></div><div><br></div><div>DM, what does <b>BibleDesktop</b> do here?<br></div><div><br></div><div class="protonmail_signature_block"><div class="protonmail_signature_block-user"><div>Best regards,<br></div><div><br></div><div>David<br></div><div><br></div><div>PS. ProtonMail converts automatically to NFC even though the text was keyed in as NFD, hence the above kludge with spaces.<br></div></div><div><br></div><div class="protonmail_signature_block-proton">Sent with <a href="https://protonmail.com" target="_blank">ProtonMail</a> Secure Email.<br></div></div><div><br></div><div>‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐<br></div><div> On 22 March 2018 1:14 PM, DM Smith <<a href="mailto:dmsmith@crosswire.org">dmsmith@crosswire.org</a>> wrote:<br></div><div> <br></div><blockquote class="protonmail_quote" type="cite"><div>It doesn’t matter that a search doesn’t use Lucene. The principle is the same. The search request has to be normalized to the same form as the searched text. For example a case insensitive search normalizes both to a single case. If it isn’t done, even on the fly, then search will fail at times. As they say, “even a blind squirrel gets a nut sometimes."<br></div><div class=""><br></div><div class="">Regarding Lucene there are mulitple different analyzers (that’s what does the normalization in Lucene). Each normalizes differently. Each has it’s own documentation. The analyzer that SWORD uses is suited and was developed for English texts. It is not appropriate for non-Latin texts. There is a multi-language analyzer that is much better, ICUAnalyzer, which follows UAX #29 for tokenization. For details see: <a class="" href="https://issues.apache.org/jira/browse/LUCENE-1488">https://issues.apache.org/jira/browse/LUCENE-1488</a> You’ll note that I participate in its development.<br></div><div class=""><br></div><div class="">The osis2mod proclivity for NFC is for display.<br></div><div class=""><br></div><div class=""><div>DM<br></div><div><div><br></div><blockquote class="" type="cite"><div class="">On Mar 22, 2018, at 8:19 AM, David Haslam <<a class="" href="mailto:dfhdfh@protonmail.com">dfhdfh@protonmail.com</a>> wrote:<br></div><div><br></div><div class=""><div class="">Thanks DM,<br></div><div class=""><br></div><div class="">Not all searches make use of the Lucene index !<br></div><div class=""><br></div><div class="">e.g. In <b class="">Xiphos</b>, the advanced search panel gives the user a choice of which type of search.<br></div><div class="">Lucene is only one of these mutually exclusive options.<br></div><div class=""><br></div><div class="">btw. Where is it documented that the creation of a Lucene search index normalizes the Unicode for the index?<br></div><div class="">Do we know for certain that this would occur irrespective of whether normalization was suppressed during module build?<br></div><div class="">i.e. With <b class="">osis2mod</b> option <b class=""> -N</b> do not convert UTF-8 or normalize UTF-8 to NFC<br></div><div class=""><br></div><div class=""><br></div><div class="protonmail_signature_block"><div class="protonmail_signature_block-user"><div class="">Best regards,<br></div><div class=""><br></div><div class="">David<br></div></div><div class=""><br></div><div class="protonmail_signature_block-proton">Sent with <a class="" target="_blank" href="https://protonmail.com/">ProtonMail</a> Secure Email.<br></div></div><div class=""><br></div><div class="">‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐<br></div><div class="">On 22 March 2018 10:20 AM, DM Smith <<a class="" href="mailto:dmsmith@crosswire.org">dmsmith@crosswire.org</a>> wrote:<br></div><div class=""><br></div><blockquote type="cite" class="protonmail_quote"><div class="">The requirement is not that the search is normalized to nfc but rather that it is normalized the same as the index. This should not be a front end issue.<br></div><div class=""><br></div><div class=""><div class="">Btw it doesn’t matter how Hebrew is stored in the module. Indexing should normalize it to a form that is internal to the engine. <br></div><div class=""><br></div><div class=""><div class="">— DM Smith<br></div><div class="">From my phone. Brief. Weird autocorrections. <br></div></div><div class=""><div class=""><br></div><div class="">On Mar 22, 2018, at 5:22 AM, David Haslam <<a class="" href="mailto:dfhdfh@protonmail.com">dfhdfh@protonmail.com</a>> wrote:<br></div></div><blockquote class="" type="cite"><div class=""><div class="">Dear all,<br></div><div class=""><br></div><div class="">Not all front-ends automatically <b class="">normalize the search string</b> to Unicode <b class="">NFC</b>.<br></div><div class="">e.g.<br></div><ul class=""><li class=""><b class="">Eloquent</b> does<br></li><li class=""><b class="">Xiphos</b> does not<br></li></ul><div class="">The data is incomplete for this feature in the table in our wiki page.<br></div><div class=""><a class="" href="https://wiki.crosswire.org/Choosing_a_SWORD_program#Search_and_Dictionary">https://wiki.crosswire.org/Choosing_a_SWORD_program#Search_and_Dictionary</a><br></div><div class=""><br></div><div class=""><span class="size" style="font-size:16px">Please would other front-end app developers supply the missing information</span>. <i class="">Thanks</i>.<br></div><div class=""><br></div><div class=""><u class="">Further thought</u>:<br></div><div class="">For front-ends that also have an <b class="">Advanced search</b> feature, would it not be a useful enhancement to have a <u class="">tick box option</u> for <b class="">Search string normalization</b>?<br></div><div class="">Then if we do make any <u class="">Biblical Hebrew</u> modules with <i class=""><b class="">custom normalization</b></i>, search could at least still work for the "corner cases" in Hebrew, providing the user gave the proper input in the search box.<br></div><div class=""><br></div><div class=""><div class="">cf. The source text for the <b class="">WLC</b> at <a class="" href="http://tanach.us/">tanach.us</a> is <u class="">not</u> normalized to NFC, but our module is.<br></div><div class=""><i class="">I'll refrain from going into a lot more detail here. There's an issue in our tracker that covers this.</i><br></div></div><div class=""><br></div><div class="protonmail_signature_block"><div class="protonmail_signature_block-user"><div class="">Best regards,<br></div><div class=""><br></div><div class="">David<br></div></div><div class=""><br></div><div class="protonmail_signature_block-proton">Sent with <a class="" href="https://protonmail.com/" target="_blank">ProtonMail</a> Secure Email.<br></div></div><div class=""><br></div></div></blockquote><blockquote class="" type="cite"><div class=""><div class=""><span class="">_______________________________________________</span><br></div><div class=""><span class="">sword-devel mailing list: <a class="" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a></span><br></div><div class=""><span class=""><a class="" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a></span><br></div><div class=""><span class="">Instructions to unsubscribe/change your settings at above page</span><br></div></div></blockquote></div></blockquote><div class=""><br></div></div></blockquote></div></div></blockquote><div><br></div></div></blockquote></div></div></body></html>