[sword-devel] seeking consensus on OSIS lemma best practice

Sat Oct 13 15:15:31 MST 2012

On 10/13/2012 05:23 PM, Chris Little wrote:
> On 10/13/2012 6:12 AM, Daniel Owens wrote:
>> Thanks, Chris. I had not thought of the latter solution, but that is
>> what we need. This raises a fundamental question: how will front-ends
>> find the right lexical entry?
>>
>> Currently, according to my understanding, a conf file may include
>> Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the
>> following value also be allowed: Feature=AramaicDef. Then front-ends
>> will be able to find entries in the correct language.
>
> HebrewDef indicates that a lexicon module is indexed by Strong's 
> numbers. Everything you've said so far indicates to me that you aren't 
> using Strong's numbers at all, so do not use Feature=HebrewDef. Also, 
> there should not ever be a Feature=AramaicDef since Aramaic Strong's 
> numbers are not distinguished from Hebrew.
>
Yes, I am not using Strong's numbers at all. I am hoping to help SWORD 
move away from its dependence upon Strong's, both the module and the 
numbers. It never occurred to me when someone told me to use 
Feature=HebrewDef that it was reserved only for Strong's numbers. But if 
that is what it does, then I understand why my suggestion to add 
AramaicDef should be discarded. No problem, though in my defense the 
nomenclature is misleading (perhaps it should be called StrongsHebrewDef?).

> I think it would probably be helpful if you could enumerate the set of 
> modules you propose to create:
>
> a Bible (just one? more than one?)
> a lexicon? separate Hebrew & Aramaic lexica?
> a morphology database? separate Hebrew & Aramaic databases?
>
I am trying to see that there are respectable free or low cost options 
for study of the Bible in Greek, Hebrew, and Aramaic. I am trying to 
envision the big picture, some of which is already filled in, and then 
work toward filling in the rest. In the end I would like to see the 
following modules.

For Greek:
- Bible Texts: MorphGNT (Greek lemma, not Strong's numbers); other 
future texts with Greek lemma, other current and future texts with 
Strong's numbers (Tischendorf, WH, KJV, etc.)
- Lexica: Strong's Greek; Abbott-Smith (Greek lemma)

For Hebrew:
- Bible Texts: WHM (Hebrew lemma); OSMHB (currently has Strong's 
numbers, but eventually I hope will have some other more up-to-date 
lemmatization)
- Lexica: Strong's Hebrew; BDB Hebrew (Hebrew lemma); BDB Aramaic 
(Aramaic lemma)

> My guess is that you are advocating a Feature value that indicates 
> "this lexicon module contains words in language X, indexed by 
> lemma/word". I would absolutely be supportive of adding this, but we 
> currently have nothing comparable in use. I would advocate 
> (Greek|Hebrew|Aramaic|...)WordDef for the value.
>
That makes sense to me. That's what I thought I was advocating. :) Just 
to make sure we care communicating, though, you mean 
Feature=GreekWordDef, etc., right?

>> But lemmatization can vary somewhat in the details within a language.
>> How could we include mappings between lemmatization? That way we could
>> map between lemmatizations so a text using Strong's numbers could look
>> up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa.
>> Perhaps a simple mapping format could be the following:
>>
>> The file StrongsGreek2AbbottSmith.map could contain:
>> G1=α
>> G2=Ἀαρών
>> G3=Ἀβαδδών
>> etc.
>>
>> Frontends could use these mappings to find the correct lexical entry. So
>> A lookup from KJV could then find the relevant entry in AbbottSmith. And
>> with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT
>> could find the correct entry in Strongs, if that is the default Greek
>> Lexicon for the front-end.
>>
>> I use Greek because I have the data ready at hand, but this method would
>> be even more important for Hebrew. I was testing with BibleTime and
>> found that only some of the lemma in WHM would find their way to the
>> correct BDB entry. This is because their lemmatizations are different.
>> Providing for a mapping would allow us to resolve those conflicts for
>> the user. Also, the OSMHB module could find entries in BDB keyed to
>> Hebrew, and the WHM could find entries in BDB or Strongs. I expect this
>> mapping would need to happen at the engine level.
>>
>> Is that a reasonable solution? Or does someone have a better idea?
>
> I believe that mapping to/from Strong's numbers is not one-to-one, but 
> many-to-many. We currently allow lookups based on lemmata by keying 
> lexica to lemmata. A lexicon can have multiple keys point to a single 
> entry.
>
Yes, mapping between them is complicated and not all cases will work 
exactly right. Yes, multiple lexical keys *sort of* point to a single 
entry. In practice they point to text that says "@LINK" and the other 
key but does not link to the actual entry. For example, I created a 
lexicon with Hebrew and Strong's keys, and the result for H1 was:

H0001 @LINK אָב

Lookup *should* be seamless, that is, the user should not have to find 
the entry manually. Maybe in some odd cases the user would need to 
scroll up or down an entry or two, but the above example would require 
scrolling ~8600 entries away. And certainly there should not be empty 
entries like what is above.

I am simply advocating a solution that will hide some of the guts of the 
data and just work for the user. Let Strong's and KJV be keyed to 
Strong's numbers, MorphGNT, WHM, Abbott-Smith, BDB, etc. keyed to 
natural language lemma. But find a way to connect them seamlessly.

> Ultimately, it would be very nice to write a stemmer for each of the 
> relevant languages, index lexica by stem (or facilitate searches by 
> stem), and thus do away with some of the need to pre-lemmatize texts. 
> I don't know whether stemming algorithms exist for Greek & Hebrew or 
> necessarily how reliable they would be, but it's an area worth some 
> research.
>
> --Chris
That task is beyond me, and as far as I know it is standard practice to 
pre-lemmatize texts. And we have the texts pre-lemmatized already. The 
real use case challenge at the moment is getting from those texts to the 
proper lexical entry. Currently to do this reliably in SWORD you have to 
stay within a lemmatization silo. In other words, working with Strong's 
texts you can get to a Strong's lexical entry very reliably. But move 
outside of that and it is inconsistent. I am just trying to find some 
solution. It does not need to be mine, but it needs to work. My proposal 
may not be the best solution, but it would save having to add foreign 
lexical keys (i.e. Strong's numbers) to lexica like Abbott-Smith or BDB.

Daniel