[sword-devel] seeking consensus on OSIS lemma best practice
Chris Little
chrislit at crosswire.org
Fri Oct 12 23:43:20 MST 2012
On 10/12/2012 1:40 PM, Daniel Owens wrote:
> The markup would look like this:
>
> Hebrew (from Deuteronomy): <w lemma="whmlemma:Hאבד"
> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>
>
> Aramaic (from Jeremiah): <w lemma="whmlemma:Aאבד"
> morph="whmmorph:some_value">יֵאבַ֧דוּ</w>
>
> The main problem I see is that other front-ends may not follow the
> process of looking for G or H and then stripping the character before
> looking up the entry.
>
> Could we come to a consensus on this?
I would recommend taking a look at the markup used in the MorphGNT
module, which also employs real lemmata rather in addition to lemmata
coded as Strong's numbers:
<w morph="robinson:N-NSF" lemma="lemma.Strong:βίβλος
strong:G0976">Βίβλος</w>
You should begin the workID for real lemmata with "lemma.", and follow
this with some identifier indicating the lemmatization scheme. We have
some code in Sword that looks for "lemma." and will treat the value as a
real word rather than a Strong's number or something else. I think OSIS
validation may complain about the workIDs of the form "lemma.system",
but that's a schema bug and you should ignore it.
As for the value of the lemma itself ([HA]אבד in your example above),
you choose the form specified in the system you are employing. So, if
MORPH employs its own lemmatization system and that takes the form
@<word> for Hebrew and %<word> for Aramaic, then use those forms, e.g.:
<w lemma="lemma.whm:@אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>
The alternative is to distinguish the languages via the workID:
<w lemma="lemma.whm.he:אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>
If you aren't creating a lexical resource that indexes based on @- and
%- prefixed lemmata, then I don't see how the former option is useful
and would recommend the latter. The latter option will allow lookups in
word-indexed lexica.
--Chris
More information about the sword-devel
mailing list