[osis-core] Schema: type on language
Chris Little
osis-core@bibletechnologieswg.org
Sun, 19 Oct 2003 09:58:32 -0700 (MST)
Todd,
On Sun, 19 Oct 2003, Todd Tillinghast wrote:
> Chris,
>
> Are you saying that you will not able to sort out which of the many
> forms allowed in IETF/xml:lang has been stated and that you would like
> to use <language type="...">language code</language> to help sort out
> with case has been encoded, but that the values for <language> and
> xml:lang would be identical?
>
> That seems resonable.
Almost. I'm saying it would be reasonable for an organization like SIL to
encode:
<language use="base" type="IETF">sq</language>
<language use="base" type="SIL">ALS</language>
That is, they should be able to identify the language according to a
common form, to be used by all documents & organizations, identical to the
form used for xml:lang (the IETF form). But they should also be able to
use a form of their own for in-house categorization.
Using values like "x-ISO-639-1-sq" might be valid, but to be of any use,
it would have to be parsed as a string and cut into chunks. I say, why
not just use type and be more explicit.
> It also seems unfortunant that the XML/ISO standards bodies have made it
> difficult for it to be obvious which standard is being used. (I am sure
> with an enumeration of all possible values you can derive which standard
> a value comes from.)
The only real ambiguity comes with discerning between ISO 639-2/T and /B.
Besides that, 2-letter elements are ISO 639-1, 3-letter are one of the -2
standards, those starting with i- are IANA, and everything starting with
x- is officially unknown.
> I am not sure why you want to add "French", "English", and "native"?
> This would seem to further confuse the situation. Maybe I don't
> understand how you would use them.
My thought was to add it as a convenience to those who might wish to use
it. Rather than forcing lookups from a table that maps codes to language
names, the name would be held in the document. The reason for choosing
English & French is that they are the international languages used by ISO
& SIL for their code databases.
If you think it would be better to leave this out, I'm okay with that.
> Relative to people using codes like "Austronesian (Other)", I think the
> documentation should recommend a "concrete" language for xml:lang and
> that a <language> entry for "Austronesian (Other)" would be fine to use
> within <work> in addition to the "concrete" language code.
I'm in agreement here. I think the value for xml:lang should match that
chosen for the IETF type, and should identify the most specific language
code that makes the encoder happy.
Going back to Albanian... Ethnologue lists 4 dialects of Albanian, all of
which would be identified with ISO 639-1 code 'sq', but different SIL
codes. Dialects of a single language can often have a common written
form. If that is the case with Albanian and I have a Bible in the
common written form, I might (if I were SIL and wanted to identify SIL
codes in my work) encode:
<osisText xml:lang="sq">
...
<language type="IETF">sq</language>
<langauge type="SIL">AAH</language>
<language type="SIL">AAE</language>
<language type="SIL">ALS</language>
<language type="SIL">ALN</language>
However, if they were not all the same written language and I had a Bible
written specifically in Tosk Albanian, I would encode:
<osisText xml:lang="x-SIL-ALN">
...
<language type="IETF">x-SIL-ALN</language>
<language type="ISO-639-1">sq</language>
Does that seem sensible?
--Chris
>
> Todd
>
> > -----Original Message-----
> > From: osis-core-admin@bibletechnologieswg.org
> > [mailto:osis-core-admin@bibletechnologieswg.org] On Behalf Of
> > Chris Little
> > Sent: Sunday, October 19, 2003 2:25 AM
> > To: osis-core@bibletechnologieswg.org
> > Subject: RE: [osis-core] Schema: type on language
> >
> >
> >
> > Todd,
> >
> > For one, it's questionable whether we can really say any
> > language can be
> > unambiguously identified. But let's suppose we really know
> > what English
> > is and we really know that 'en' identifies it. ISO 639 does
> > a better job
> > of unambiguously identifying some languages than it does for others.
> > There are a bunch of codes that describe groups of codes,
> > such as "Native
> > America Indian" and "Austronesian (Other)".
> >
> > So, it's not quite true that Javanese has no ISO code, it's
> > just a very,
> > very ambiguous code shared with hundreds of other langauges.
> > (The code
> > would be 'map' -- "Austronesian (Other)".)
> >
> > I think it is valuable to keep type="...", since some
> > organizations use
> > those codes themselves for various sorting purposes (e.g. the
> > Library of
> > Congress uses ISO 639-2/B and SIL uses Ethnologue codes). If
> > they need to
> > use such data, I think we should provide a place to hold it.
> >
> > But for interoperability, IETF/xml:lang is probably best.
> >
> > What are your thoughts on also adding "English", "French", &
> > "native" to
> > the types enumeration. Is that unnecessary/inappropriate?
> >
> >
> > --Chris
> >
> >
> > On Fri, 17 Oct 2003, Todd Tillinghast wrote:
> >
> > > Chris,
> > >
> > > If there is a way to unambiguously express ALL of the
> > various language
> > > values using xml:lang in a IETF compliant string then it
> > would seem to
> > > make sense to use that same structure for the value of
> > <language> and
> > > for xml:lang AND not have a type="..." set of enumerated types.
> > >
> > > Ex:
> > > Javanese for which there is not ISO code:
> > > <osisText xml:lang="x-SIL-JVN">
> > > and
> > > <work>
> > > <language>x-SIL-JVN</language>
> > > </work>
> > >
> > > Albanian:
> > > <osisText xml:lang="sq">
> > > and
> > > <work>
> > > <language>sq</language>
> > > <language>x-ISO-639-1-sq</language>
> > > <language>x-ISO-639-2-T-sqi</language>
> > > <language>x-ISO-639-2-B-alb</language>
> > > <language>x-SIL-ALS</language>
> > > </work>
> > >
> > > This would keep the xml:lang and <language> values consistent. It
> > > would seem that we will have to enumerate the "x-" alternatives for
> > > xml:lang in the documentation so we might as well use the same
> > > structure both places.
> > >
> > > I believe that "x-" is allowed in the w3c's xml.xsd schema so the
> > > above options should work. (Naturally if there is already an
> > > established syntax for ISO values within xml:lang we should use it
> > > rather than my x- values above.)
> >
> >
> >
> >
> > _______________________________________________
> > osis-core mailing list
> > osis-core@bibletechnologieswg.org
> > http://www.bibletechnologieswg.org/mailman/lis> tinfo/osis-core
> >
>
> _______________________________________________
> osis-core mailing list
> osis-core@bibletechnologieswg.org
> http://www.bibletechnologieswg.org/mailman/listinfo/osis-core
>