[osis-core] Linguistic Annotation Module Design Document -- linguistic
issues
Chris Little
osis-core@bibletechnologieswg.org
Fri, 07 Nov 2003 15:50:44 -0600
Kirk,
To some degree, some of these issues will be answered along with the
question from my previous reply regarding whether the LA module should
be able to handle all languages (needing just a language declaration
document).
The elements listed include just <w> and <morpheme>, with the only
change to <w> being the inclusion of <morpheme>. I would further
suggest adding most of the attributes assigned to <morpheme> to <w>.
Many existing texts only describe features down to the word level.
Nonetheless, if we encountered a word like "walked" we might wish to add
parsing info, such as <w tense="past">walked</w>, if we lacked the data
to identify features of specific morphemes. All of CrossWire's Greek
texts marked with morphological data are in a situation that would
prevent them from being able to use the LA module if data like this were
not allowed at the word level. The situation we would be forced into
would be to put a <morpheme> element inside of every <w> element for the
purpose of hanging attributes even though the contents of the <morpheme>
element would not actually be morphemes.
If a <morpheme> is marked with a number attribute, there are two
different ways I can think of that it could be interpreted. It might
seem that they should be obvious from context, but I still think a
method of disambiguation would be valuable. The number attribute could
indicate either feature embodiment/assignment or agreement. E.g. the
sentence "He walks." would probably be marked as <p><w><morpheme
number="singular">He</morpheme></w>
<w><morpheme>walk</morpheme><morpheme
number="singular">s</morpheme></w></p>.
This seems to raise a number of issues. Since this verb happens to be
intransitive and English only has subject person-number agreement, it's
obvious what 's' agrees with in number. Plenty of languages would need
a facility for distinguishing between subject and object agreement. Is
"subject agreement" a possible value of the "pos" attribute? (I'm
generally unclear of the function of "pos" on a morpheme, since this is
a feature of words in every grammatical framework with which I am
familiar, and most deny you the right to look back into a word, separate
the affixes, and identify them with parts of speech.)
Regarding non-linear affixation, I would suggest providing a facility
like we have for quotation in the core schema: allow a splitID on
<morpheme> and allow recursive embedding of <morpheme>. For example, in
German, you've got singular Apfel 'apple', plural Äpfel. Pluralization
occurs by non-linear affixation, namely umlaut, identified graphically
by the diaeresis. I would encode this as (roughly)
<w><morpheme>A<morpheme
number="plural">¨</morpheme>pfel</morpheme></w>. I don't have any idea
how you would mark morphemes that are not graphically represented, such
as the intonational difference that derives the noun -'produce- from the
verb -pro'duce-; those might just have to be assumed to be suppletive.
There are a number of Hebrew-specific attributes, which seem to be all
of those marked by a star. I think (and I assume everyone would agree
with me on this--and hope everyone can be convinced of this, if not)
that a person doing linguistic annotation of a text should have the
ability to use the terms that are standard to work in that language.
E.g., if I'm working on German, I would want to be able to mark noun
genders and in Hebrew I would want to be able to identify stem types.
That said, I think the Hebrew linguistic vocabulary might be so
distinctive as to deserve being removed to another module (one derived
from the LA module). Alternatively, maybe they could be indicated by a
prefix like "heStem" instead of simply "stem", if Hebrew is deemed to be
too central to OSIS LA to be removed like that. (Side note: isn't there
a histpael stem? I seem to remember loosing some points on a quiz for
marking a verb as hitpael--stupid me.)
I would recomment that the "kqtype" be removed from the LA module
entirely, since it's not linguistic in nature. We should probably add
<seg type="ketiv"> and <seg type="qere"> to the next release of OSIS
Core--or else a more permanent solution.
More generally.... Verbs typically can have (in addition to tense,
which is already accounted for): aspect, voice, mood, & modality. These
all probably deserve attributes on the morpheme. Case is also a notable
omission from the attributes that would apply to nouns, and I would
further suggest adding semanticRole or something equivalent. I strongly
recommend a gloss attribute as well on the morpheme. If anyone uses the
LA module to generate an interlinear, it will be necessary. Gender,
cross-linguistically does not have a range of values that can be
enumerated. Masculine/feminine/neuter are good standard values for many
languages, but, e.g. Dyirbal would use numerals 1-4 (men, women, edible
plants, & other), Korean might use myriad values like "paper", "stick",
"color", etc.
Inflectional morphology is pretty well handled, but most derivational
morphology isn't, in the proposed system. There's no means for
signaling that a morpheme derives a causative, applicative, passive,
antipassive, reflexive, reciprocal, etc. form of a verb. Nor is there a
means for indicating nominalizers, adjectivizers, etc. (That is, unless
this is the function of the pos attribute, or something else that I'm
not noticing.)
I think we should also adopt a set of values (like 'past', 'preterite',
'noun', 'subjunctive', 'passive', etc.) defined by some outside
linguistic authority. I checked EAGLES for a list, but theirs is both
very incomplete and frequently mis-classifies attributes in a way that
suggests to me that they shouldn't be trusted as an authority. Perhaps
the LSA or some international body has compiled a usable authority list.
That's all for now, at least.
--Chris
Kirk Lowery wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Friends,
>
> For your amusement -- but more especially for your expert comment -- I
> attach a first draft of a schema design document for OSIS linguistic
> annotation; more precisely, for morphologic annotation. We'll get to
> syntactic annotation after this. This is the concrete outcome of the
> intensive three days of face to face work Steve and I did last week.
>
> - --
> Kirk E. Lowery, Ph.D.
> Director, Westminster Hebrew Institute
> Adjunct Professor of Old Testament
> Westminster Theological Seminary, Philadelphia
>
> Theorie ist, wenn man alles weiss und nichts klappt.
> Praxis ist, wenn alles klappt und keiner weiss warum.
> Bei uns sind Theorie und Praxis vereint:
> nichts klappt und keiner weiss warum!
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.1 (MingW32)
> Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
>
> iD8DBQE/pmOSfUA6+Yl7duERArbmAKCPWUAGbMLRI8+PmycwjUTwGZHoYwCg0jkc
> O8WsRiTQ2MVUbRtuSOeNbkE=
> =jKEb
> -----END PGP SIGNATURE-----