[sword-devel] seeking consensus on OSIS lemma best practice

Sat Oct 13 06:12:28 MST 2012

On 10/13/2012 02:43 AM, Chris Little wrote:
> On 10/12/2012 1:40 PM, Daniel Owens wrote:
>> The markup would look like this:
>>
>> Hebrew (from Deuteronomy): <w lemma="whmlemma:Hאבד"
>> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>
>>
>> Aramaic (from Jeremiah): <w lemma="whmlemma:Aאבד"
>> morph="whmmorph:some_value">יֵאבַ֧דוּ</w>
>>
>> The main problem I see is that other front-ends may not follow the
>> process of looking for G or H and then stripping the character before
>> looking up the entry.
>>
>> Could we come to a consensus on this?
>
> I would recommend taking a look at the markup used in the MorphGNT 
> module, which also employs real lemmata rather in addition to lemmata 
> coded as Strong's numbers:
>
> <w morph="robinson:N-NSF" lemma="lemma.Strong:βίβλος 
> strong:G0976">Βίβλος</w>
>
> You should begin the workID for real lemmata with "lemma.", and follow 
> this with some identifier indicating the lemmatization scheme. We have 
> some code in Sword that looks for "lemma." and will treat the value as 
> a real word rather than a Strong's number or something else. I think 
> OSIS validation may complain about the workIDs of the form 
> "lemma.system", but that's a schema bug and you should ignore it.
>
> As for the value of the lemma itself ([HA]אבד in your example above), 
> you choose the form specified in the system you are employing. So, if 
> MORPH employs its own lemmatization system and that takes the form 
> @<word> for Hebrew and %<word> for Aramaic, then use those forms, e.g.:
>
> <w lemma="lemma.whm:@אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>
>
> The alternative is to distinguish the languages via the workID:
>
> <w lemma="lemma.whm.he:אבד"> morph="whmmorph:some_value">תֹּאבֵדוּן֮</w>
>
> If you aren't creating a lexical resource that indexes based on @- and 
> %- prefixed lemmata, then I don't see how the former option is useful 
> and would recommend the latter. The latter option will allow lookups 
> in word-indexed lexica.
>
> --Chris
>
Thanks, Chris. I had not thought of the latter solution, but that is 
what we need. This raises a fundamental question: how will front-ends 
find the right lexical entry?

Currently, according to my understanding, a conf file may include 
Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the 
following value also be allowed: Feature=AramaicDef. Then front-ends 
will be able to find entries in the correct language.

But lemmatization can vary somewhat in the details within a language. 
How could we include mappings between lemmatization? That way we could 
map between lemmatizations so a text using Strong's numbers could look 
up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa. 
Perhaps a simple mapping format could be the following:

The file StrongsGreek2AbbottSmith.map could contain:
G1=α
G2=Ἀαρών
G3=Ἀβαδδών
etc.

Frontends could use these mappings to find the correct lexical entry. So 
A lookup from KJV could then find the relevant entry in AbbottSmith. And 
with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT 
could find the correct entry in Strongs, if that is the default Greek 
Lexicon for the front-end.

I use Greek because I have the data ready at hand, but this method would 
be even more important for Hebrew. I was testing with BibleTime and 
found that only some of the lemma in WHM would find their way to the 
correct BDB entry. This is because their lemmatizations are different. 
Providing for a mapping would allow us to resolve those conflicts for 
the user. Also, the OSMHB module could find entries in BDB keyed to 
Hebrew, and the WHM could find entries in BDB or Strongs. I expect this 
mapping would need to happen at the engine level.

Is that a reasonable solution? Or does someone have a better idea?

Daniel