[sword-devel] language/locale codes

Wed Nov 11 20:05:35 MST 2009

DM Smith wrote:
> Chris, Thanks very much for this.
> 
> I'm wondering about a few things I have seen. In some languages there
> are ASCII equivalents for accented forms. I'm thinking we probably
> shouldn't use the ASCII forms. Example: Bokmål is sometimes Bokmaal
> and even Bokmal (The SIL files consistently use the accented form)

My plan is to generate data using the versions of characters with
diacritics. So long as there remain problems with particular
implementations, we can do substitutions at a later stage in the script
chain. As I recall, Xiphos' data had a few small corrections like this,
so for them we can do a quick s/Bokmål/Bokmaal/ in the script to
generate in their format.

Later on, when we really want to stick some code in the engine, we'll
need to have a better solution to why this problem occurs in various
implementations.

> The iso-639-3_Name_Index.tab gives an inverted form. If localized.txt
> is giving the English form for a particular code and not the native,
> I think, from an English speakers perspective, these would be a
> preferred form to use in localized.txt as alphabetizing the names
> will put language families together. For example, we have quite a few
> Zapotec modules with different language codes, whose language is of
> the form XYZ Zapotec The inverted form is Zapotec, XYZ

Yes, I really liked your idea of using the inverted form, so I plan to
use that. You probably made this comment based on the big long version
of localized.txt that came (almost directly) from Xiphos' code.
Yesterday I eliminated over 95% of the content from that file by
removing those entries that were identical to the ISO standard name.

The unique data from localized.txt is now at
http://www.crosswire.org/wiki/Localized_Language_Names, and I'll add
some more code to my scripts to grab this data from the Wiki so that it
can do automated updates, as with all of the other data.

> In comparing names between these files and the
> locales.d/xxx-utf8.conf, I think there may be some corruption in the
> locales.d utf8 confs. For example, nb-utf8.conf has [Meta] Name=nb 
> Description=Bokm√•l (Unicode) (I noticed yesterday, while working on
> something else, that perl may write UTF8 in this form when the
> "<:utf8", or its equivalent, is not used on creating a file handle
> for write.)
> 
> The reason I'm looking at Bokmål is that I am fixing a problem in
> JSword regarding it being wrongly encoded as Bokm√•l.

I just check nb-utf8.conf and can't see any problem. We don't use the 
BOM in locale .confs, so my guess is that something is misinterpreting 
the encoding as something other than UTF-8.

I'm using BabelPad to check the encoding and get the same (correct) text 
whether I let it autodetect the encoding or specify to open as UTF-8.

> My preference is to use a hierarchical approach. For example, when
> looking up a code in a given locale xx-YY, first look for it in a
> file localized-xx-YY.txt, where xx is the language code and YY is the
> country code. If the file does not exist and the code is not in that
> file, look for it in localized-xx.txt. Failing that look in
> localized.txt. Failing that do something graceful.
> 
> This is how JSword does it using Java's built in localization
> mechanism. For performance, the locale specific files are pruned to
> the set of codes that are in SWORD modules. The default file has all
> the languages. That way, if a new language code is used and it is not
> in the localized file, but is in the default, we don't have to hurry
> a release.
> 
> I'm glad to have the native form in the base/default file.

I think the hierarchical approach is good. At the moment, we have all of 
the ISO 639-1, -2, and -5 data, ISO 15924 data, and ISO 3166-1 data 
available in French, the names in localized.txt are localized to 
themselves, and everything is available in English.

It probably makes sense to store localized locale names within our 
existing locale files. But for our current data set, that just means 
adding a bunch of data to the fr.conf file, since English data isn't 
stored in a .conf and the content of localzed.txt doesn't correspond to 
a single locale.

> I think Troy was wanting 1.6.1 to be built from trunk and for it to
> be ABI compatible with 1.6.0.
> 
> In SVN, branching is cheap. The difficulty is doing the merge. You
> could create a branch for this new feature and work ahead of 1.6.1.
> Then when 1.6.1 is released, merge your work into trunk.

If we want to start a non-ABI compatible branch, I'd certainly be happy 
to use it, but I don't even have the data all sorted out yet, so I'm not 
too concerned with committing code yet.

> On another note, do you envision having distribution mechanism of
> these files apart from front-end releases, such as putting them in a
> known place for download?

That would be great. Putting the data in a stable location is certainly 
something we can do with ease. Dealing with updates isn't something I've 
considered. Perhaps that should be left up to individual front ends. 
Perhaps we can piggyback off the installmgr facilities.

--Chris