[jsword-devel] Hamza - was strongs search

Sun May 18 14:34:23 MST 2008

On May 18, 2008, at 4:59 PM, Peter von Kaehne wrote:

> This is not merely an issue
>> with arabic/farsi but with every "accented" language, e.g. French,
>> Hebrew, German, Greek, .....
>
> But it is of varying criticality in the various languages.

I was pointing out not the usefulness of accents or how one would  
enter it in the computer, but the bit sequence of the text and of the  
search request. These can be different and have exactly the same  
semantic.

The problem in BD today, is that we look for code point matches, where  
the user's entry has to match the text's content. With software, we  
can minimize the mismatch.

>
>
> In German an umlaut is so much part of the language that no one would
> ever even consider searching without umlaut.

Yes. But the text in unicode could have the umlaut encoded into the  
single code point of 2 bytes. Or it could have it be decomposed into  
the letter followed by the umlaut, 2 code points, totaling 3 bytes. If  
it is stored as 3 bytes in the index, then it takes 3 bytes in the  
search to find it, not 2.

Alternatively, umlauts have traditionally been expressed by following  
the letter that should take the umlaut with the letter e as in 'oe'.  
It would be reasonable to allow the user to enter this and find the  
word with the umlaut. Likewise it would be reasonable for a user to  
enter a real umlaut and find this substitution in the text.

>
>
> In Farsi due to years of bad computer/software localisation people and
> also varying forms of orthography (the hamza can be replaced by a  
> single
> "yeh" or by an alef + yeh sequence, depending which school of thought
> you subscribe to) all kinds of fomrs abound.

Perhaps, this could be part of normalization? I'm not sure that  
visually equivalent orthographic forms are valid substitutions, but  
people do it all the time. An ideal goal would be that all input forms  
should be valid, but during a search, behind the scenes, both the  
search request and the text are normalized according to the same  
rules. (Actually, we would normalize the text as we invert it into the  
index.)

What we saw before (and still is a problem) was that one could type a  
Farsi search request and get no hits and then copy and paste the same  
from displayed text to get hits. This is the problem I'm talking about.

Am I missing something?

>
>
>> Supposedly osis2mod as it is in svn will normalize to NFC. But I am
>> wondering whether it actually is doing that. NFC composes the  
>> letters,
>> that is the letter with the accents is the norm.
>>
>
> Osis2mod does not seem to do this for Farsi - all modules where  
> created
> and recreated at the same time.

I'm wondering whether icu4c has appropriate NFC normalization for Farsi.

In Him,
	DM