[jsword-devel] Hamza - was strongs search
DM Smith
dmsmith555 at yahoo.com
Sun May 18 14:34:23 MST 2008
On May 18, 2008, at 4:59 PM, Peter von Kaehne wrote:
> This is not merely an issue
>> with arabic/farsi but with every "accented" language, e.g. French,
>> Hebrew, German, Greek, .....
>
> But it is of varying criticality in the various languages.
I was pointing out not the usefulness of accents or how one would
enter it in the computer, but the bit sequence of the text and of the
search request. These can be different and have exactly the same
semantic.
The problem in BD today, is that we look for code point matches, where
the user's entry has to match the text's content. With software, we
can minimize the mismatch.
>
>
> In German an umlaut is so much part of the language that no one would
> ever even consider searching without umlaut.
Yes. But the text in unicode could have the umlaut encoded into the
single code point of 2 bytes. Or it could have it be decomposed into
the letter followed by the umlaut, 2 code points, totaling 3 bytes. If
it is stored as 3 bytes in the index, then it takes 3 bytes in the
search to find it, not 2.
Alternatively, umlauts have traditionally been expressed by following
the letter that should take the umlaut with the letter e as in 'oe'.
It would be reasonable to allow the user to enter this and find the
word with the umlaut. Likewise it would be reasonable for a user to
enter a real umlaut and find this substitution in the text.
>
>
> In Farsi due to years of bad computer/software localisation people and
> also varying forms of orthography (the hamza can be replaced by a
> single
> "yeh" or by an alef + yeh sequence, depending which school of thought
> you subscribe to) all kinds of fomrs abound.
Perhaps, this could be part of normalization? I'm not sure that
visually equivalent orthographic forms are valid substitutions, but
people do it all the time. An ideal goal would be that all input forms
should be valid, but during a search, behind the scenes, both the
search request and the text are normalized according to the same
rules. (Actually, we would normalize the text as we invert it into the
index.)
What we saw before (and still is a problem) was that one could type a
Farsi search request and get no hits and then copy and paste the same
from displayed text to get hits. This is the problem I'm talking about.
Am I missing something?
>
>
>> Supposedly osis2mod as it is in svn will normalize to NFC. But I am
>> wondering whether it actually is doing that. NFC composes the
>> letters,
>> that is the letter with the accents is the norm.
>>
>
> Osis2mod does not seem to do this for Farsi - all modules where
> created
> and recreated at the same time.
I'm wondering whether icu4c has appropriate NFC normalization for Farsi.
In Him,
DM
More information about the jsword-devel
mailing list