[sword-devel] seeking consensus on OSIS lemma best practice
Daniel Owens
dcowens76 at gmail.com
Sat Oct 13 15:15:31 MST 2012
On 10/13/2012 05:23 PM, Chris Little wrote:
> On 10/13/2012 6:12 AM, Daniel Owens wrote:
>> Thanks, Chris. I had not thought of the latter solution, but that is
>> what we need. This raises a fundamental question: how will front-ends
>> find the right lexical entry?
>>
>> Currently, according to my understanding, a conf file may include
>> Feature=HebrewDef. To distinguish Hebrew from Aramaic, I suggest the
>> following value also be allowed: Feature=AramaicDef. Then front-ends
>> will be able to find entries in the correct language.
>
> HebrewDef indicates that a lexicon module is indexed by Strong's
> numbers. Everything you've said so far indicates to me that you aren't
> using Strong's numbers at all, so do not use Feature=HebrewDef. Also,
> there should not ever be a Feature=AramaicDef since Aramaic Strong's
> numbers are not distinguished from Hebrew.
>
Yes, I am not using Strong's numbers at all. I am hoping to help SWORD
move away from its dependence upon Strong's, both the module and the
numbers. It never occurred to me when someone told me to use
Feature=HebrewDef that it was reserved only for Strong's numbers. But if
that is what it does, then I understand why my suggestion to add
AramaicDef should be discarded. No problem, though in my defense the
nomenclature is misleading (perhaps it should be called StrongsHebrewDef?).
> I think it would probably be helpful if you could enumerate the set of
> modules you propose to create:
>
> a Bible (just one? more than one?)
> a lexicon? separate Hebrew & Aramaic lexica?
> a morphology database? separate Hebrew & Aramaic databases?
>
I am trying to see that there are respectable free or low cost options
for study of the Bible in Greek, Hebrew, and Aramaic. I am trying to
envision the big picture, some of which is already filled in, and then
work toward filling in the rest. In the end I would like to see the
following modules.
For Greek:
- Bible Texts: MorphGNT (Greek lemma, not Strong's numbers); other
future texts with Greek lemma, other current and future texts with
Strong's numbers (Tischendorf, WH, KJV, etc.)
- Lexica: Strong's Greek; Abbott-Smith (Greek lemma)
For Hebrew:
- Bible Texts: WHM (Hebrew lemma); OSMHB (currently has Strong's
numbers, but eventually I hope will have some other more up-to-date
lemmatization)
- Lexica: Strong's Hebrew; BDB Hebrew (Hebrew lemma); BDB Aramaic
(Aramaic lemma)
> My guess is that you are advocating a Feature value that indicates
> "this lexicon module contains words in language X, indexed by
> lemma/word". I would absolutely be supportive of adding this, but we
> currently have nothing comparable in use. I would advocate
> (Greek|Hebrew|Aramaic|...)WordDef for the value.
>
That makes sense to me. That's what I thought I was advocating. :) Just
to make sure we care communicating, though, you mean
Feature=GreekWordDef, etc., right?
>> But lemmatization can vary somewhat in the details within a language.
>> How could we include mappings between lemmatization? That way we could
>> map between lemmatizations so a text using Strong's numbers could look
>> up words in a lexicon keyed to Greek, Hebrew or Aramaic and vice versa.
>> Perhaps a simple mapping format could be the following:
>>
>> The file StrongsGreek2AbbottSmith.map could contain:
>> G1=α
>> G2=Ἀαρών
>> G3=Ἀβαδδών
>> etc.
>>
>> Frontends could use these mappings to find the correct lexical entry. So
>> A lookup from KJV could then find the relevant entry in AbbottSmith. And
>> with a similar mapping MorphGNT2StrongsGreek.map a lookup from MorphGNT
>> could find the correct entry in Strongs, if that is the default Greek
>> Lexicon for the front-end.
>>
>> I use Greek because I have the data ready at hand, but this method would
>> be even more important for Hebrew. I was testing with BibleTime and
>> found that only some of the lemma in WHM would find their way to the
>> correct BDB entry. This is because their lemmatizations are different.
>> Providing for a mapping would allow us to resolve those conflicts for
>> the user. Also, the OSMHB module could find entries in BDB keyed to
>> Hebrew, and the WHM could find entries in BDB or Strongs. I expect this
>> mapping would need to happen at the engine level.
>>
>> Is that a reasonable solution? Or does someone have a better idea?
>
> I believe that mapping to/from Strong's numbers is not one-to-one, but
> many-to-many. We currently allow lookups based on lemmata by keying
> lexica to lemmata. A lexicon can have multiple keys point to a single
> entry.
>
Yes, mapping between them is complicated and not all cases will work
exactly right. Yes, multiple lexical keys *sort of* point to a single
entry. In practice they point to text that says "@LINK" and the other
key but does not link to the actual entry. For example, I created a
lexicon with Hebrew and Strong's keys, and the result for H1 was:
H0001 @LINK אָב
Lookup *should* be seamless, that is, the user should not have to find
the entry manually. Maybe in some odd cases the user would need to
scroll up or down an entry or two, but the above example would require
scrolling ~8600 entries away. And certainly there should not be empty
entries like what is above.
I am simply advocating a solution that will hide some of the guts of the
data and just work for the user. Let Strong's and KJV be keyed to
Strong's numbers, MorphGNT, WHM, Abbott-Smith, BDB, etc. keyed to
natural language lemma. But find a way to connect them seamlessly.
> Ultimately, it would be very nice to write a stemmer for each of the
> relevant languages, index lexica by stem (or facilitate searches by
> stem), and thus do away with some of the need to pre-lemmatize texts.
> I don't know whether stemming algorithms exist for Greek & Hebrew or
> necessarily how reliable they would be, but it's an area worth some
> research.
>
> --Chris
That task is beyond me, and as far as I know it is standard practice to
pre-lemmatize texts. And we have the texts pre-lemmatized already. The
real use case challenge at the moment is getting from those texts to the
proper lexical entry. Currently to do this reliably in SWORD you have to
stay within a lemmatization silo. In other words, working with Strong's
texts you can get to a Strong's lexical entry very reliably. But move
outside of that and it is inconsistent. I am just trying to find some
solution. It does not need to be mine, but it needs to work. My proposal
may not be the best solution, but it would save having to add foreign
lexical keys (i.e. Strong's numbers) to lexica like Abbott-Smith or BDB.
Daniel
More information about the sword-devel
mailing list