[sword-devel] Unicode supprt in lexicons

Wed Jun 22 15:54:35 MST 2005

fwiw here's my opinion on what the standards should be. I definitely
agree that there should be standards.

On 22/06/05, Joachim Ansorg <nospam+sword-devel at joachim-ansorg.de> wrote:
> Hi,
> I'm struggling with the unicode stuff of lexicons and lexicons in general.
> 
> Currently a frontend doesn't know whether to expect keys as utf8 or as
> something else. because there's no standard defined. The same is valid of
> GenBooks.

It seems reasonable to me that all text, keys, everything in all types
of modules should be in UTF-8.

> Secondly, the sort oder is not valid for unicode if unicode characters are
> used in the entry names.
> That way unicode strings like the german "a umlaut" appear in the end, but
> they should be among the firtst entries of the list. Sorting in the frontend
> moves the lexicon intro somewhere into the middle of the list and is
> slow(er).

Unicode defines collation(sorting). 
http://www.unicode.org/reports/tr10/

The entries should be sorted using something that implements the
algorithm by the module creation app. ICU should do the job and
doesn't have to be linked into the runtime lib to be able to do this.
It only needs to be linked into the module creation app. The way it
collates is language specific so it should get German right.

I think perl and python should also be able to do collation so they
are another option.

> Thirdly, the lexicon intro is a hack, it uses a lot of prepended spaces to be
> in the first place of the list.
> We need to find a better solution for that.

Agreed (sorry, I don't have one offhand)

> I'm missing defined standards for the API and the modules. That would make
> frontend development a lot easier.

Agreed,
Daniel