[sword-devel] seeking consensus on OSIS lemma best practice

Fri Oct 12 23:43:20 MST 2012

On 10/12/2012 1:40 PM, Daniel Owens wrote:
> The markup would look like this:
>
> Hebrew (from Deuteronomy): <w lemma="whmlemma:Hאבד"
> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>
>
> Aramaic (from Jeremiah): <w lemma="whmlemma:Aאבד"
> morph="whmmorph:some_value">יֵאבַ֧דוּ</w>
>
> The main problem I see is that other front-ends may not follow the
> process of looking for G or H and then stripping the character before
> looking up the entry.
>
> Could we come to a consensus on this?

I would recommend taking a look at the markup used in the MorphGNT 
module, which also employs real lemmata rather in addition to lemmata 
coded as Strong's numbers:

<w morph="robinson:N-NSF" lemma="lemma.Strong:βίβλος 
strong:G0976">Βίβλος</w>

You should begin the workID for real lemmata with "lemma.", and follow 
this with some identifier indicating the lemmatization scheme. We have 
some code in Sword that looks for "lemma." and will treat the value as a 
real word rather than a Strong's number or something else. I think OSIS 
validation may complain about the workIDs of the form "lemma.system", 
but that's a schema bug and you should ignore it.

As for the value of the lemma itself ([HA]אבד in your example above), 
you choose the form specified in the system you are employing. So, if 
MORPH employs its own lemmatization system and that takes the form 
@<word> for Hebrew and %<word> for Aramaic, then use those forms, e.g.:

<w lemma="lemma.whm:@אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>

The alternative is to distinguish the languages via the workID:

<w lemma="lemma.whm.he:אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>

If you aren't creating a lexical resource that indexes based on @- and 
%- prefixed lemmata, then I don't see how the former option is useful 
and would recommend the latter. The latter option will allow lookups in 
word-indexed lexica.

--Chris