[jsword-devel] Hamza - was strongs search

Sun May 18 13:45:27 MST 2008

On May 18, 2008, at 4:07 PM, Peter von Kaehne wrote:

> Sorry long story - but - as you will see related to this thread:
>
> One of my Farsi Bible modules has a problem in BD - it shows boxes
> wherever a hamza diacritic is used. A hamza is a funny little sign in
> Farsi and Arabci + related scripts, which can be used as an individual
> letter or be added to some as a diacritic.
>
> It often means a glottal stop or - and that is the critical use for me
> in Farsi - it is used to indicate a genitive "-ye" when attached to  
> a an
> end "h".
>
> As with many diacritics there are two ways of encoding it in unicode -
> individually as a "h" and a "hamza", which then are  rendered  
> jointly by
> the font rendering machine as a "h" with a hamza above it or simply a
> single code point for a h with hamza.
>
> Two Farsi modules use the latter option - and BD displays it fine. A
> third module uses the first option and BD produces squares.  
> (Gnomesword
> does all three fine)
>
> I was therefore thinking to run a search and replace and replace all
> occurances of a "h" + "hamza" sequence with a single "h with hamza"  
> but
> then stumbled when I though whether this will have implications for
> search. And then I came home from a long weekend and found this thread
> which mentioned marginally searches with diacritics.
>
> It seems that my options are
>
> 1) to have graphicly correct text, but not fully searchable or
> 2) poorly rendered text, but fully and correctly searchable.
>
> Can you confirm whether I understood this correctly?
>
> What are your suggestions?

I suggest that it look good. We can add search later.

Search will be a problem until we do unicode normalization. That will  
require adding icu4j and normalizing as we build the index and  
normalizing the search requests. Unless both the search and the index  
use exactly the same form, it won't hit. This is not merely an issue  
with arabic/farsi but with every "accented" language, e.g. French,  
Hebrew, German, Greek, .....

Supposedly osis2mod as it is in svn will normalize to NFC. But I am  
wondering whether it actually is doing that. NFC composes the letters,  
that is the letter with the accents is the norm.

In Him,
	DM