[sword-devel] Normalization?

DM Smith dmsmith at crosswire.org
Wed Aug 31 04:58:27 MST 2011

On Aug 31, 2011, at 4:01 AM, David Haslam wrote:

> Thanks for detailed comments on rendering.
> Are there any implications for the search feature of SWORD/JSword when using
> combining characters?

The simple rule is that if a search request and the indexed text are not normalized the same, there will not be a hit.

Today, our frontends do not normalize the text into a particular normalization form when building the search index. Ditto for the search request. They leave it up to the module builder and the end user to agree by accident, which works really well for English. But fails miserably with decorated characters.

It'd be best for SWORD/JSword to do ICU normalization to a known form for search. Note, that it could be to NFKD and then stripped of decorations. Since it would be an internal form it doesn't matter that it would look ugly to the end user.

Regarding rendering, each frontend should not assume that the module is encoded in a way that works for it. When we did experiments, NFC was the best across the widest variety of frontends. But no one way was best for every script, font or display engine. It'd be best for each frontend to normalize the text before display. This probably would be different than the normalization for search.

In Him,

> David
> --
> View this message in context: http://sword-dev.350566.n4.nabble.com/Normalization-tp3779484p3780433.html
> Sent from the SWORD Dev mailing list archive at Nabble.com.
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

More information about the sword-devel mailing list