[jsword-devel] Lucene Indexes

Mullins, Steven Steven.Mullins at dmme.virginia.gov
Thu May 22 12:12:09 MST 2008


DM,

OK, Here is my patch to enable searching lexical/base forms for 
the MorphGNT Module.  
usage example: [john 1-4] lex:??????? && lex:??????

You can search by the morph field also.
usage example: [john 1-4] lex:?????? && morph:n-nsm

However, the lex and morph field searches are bound to the verse they
fall in, not a particular word.  I have no solution yet to this 
problem.  It would be best I think as you have suggested to index
each word, maybe like: book.chapter.verse.wordnumber.  For example, the
<verse osisID='John.1.10'> is the key for the index now.  Should 
the key be deepened to say John.1.10.1, John.1.10.2, John.1.10.3, etc 
and be stored in a lucene field?  This would seem to break the search
paradigm in jsword.  What do you think?

This is not the only hurdle to accurate morphologic searching in jsword.
James Tauber (http://morphgnt.org/projects/ccat-morphgnt) has done
a wonderful job of formatting the CCAT MorphGNT text for textual analysis.
Much of the formatting is lost in the Crosswire.org modules as the
lexical data is formatted is a specific way.  The "raw" format is much
more amenable to direct analysis.  For example, here is the text
from the raw MorphGNT text for John 1:1, the first 6 words:

040101 P- -------- ?? ??
040101 N- ----DSF- ???? ????
040101 V- 3IAI-S-- ?? ????
040101 RA ----NSM- ? ?
040101 N- ----NSM- ????? ?????

This is the description of the format from James' site:

First column is the book/chapter/verse. (Note that shorter ending of
Mark appears after longer)

Second column is the part of speech:

    * A- adjective
    * C- conjunction
    * D- adverb
    * I- interjection
    * N- noun
    * P- preposition
    * RA article
    * RD demonstrative
    * RI interrogative/indefinite pronoun
    * RP personal/possessive pronoun
    * RR relative pronoun
    * V- verb
    * X- particle 

Third column has eight slots for parse codes:

    * Person: 1, 2, 3
    * Tense: Aorist, Future, Imperfect, Present, X-perfect, Y-pluperfect
    * Voice: Active, Middle, Passive
    * Mood: D-imperative, Indicative, N-infinitive, O-optative,
      P-participle, S-subjunctive
    * Case: Accusative, Dative, Genitive, Nominative, Vocative
    * Number: Plural, Singular
    * Gender: Feminine, Masculine, Neuter
    * Degree: Comparative, Superlative 

Fourth column is the form that appears in the UBS3/NA26 text.
Fifth column is the lemma or dictionary form.

Now, if we could place these codes into a doc OSIS with like: 

<w lemma="lex:????" morph="ccat:N-----DSF-">????</w>

This is far easier and less error prone to parse a tag in a 
fixed field format like "V-3IAI-S--" than the current tags of 
"V-IAI-3S".  Variable for voice, tense, mood, person, case,
number could be populated from a search dialog or query and
matches returned by simply checking for a value in the appropriate
position of the ccat tag.  The MorphGNT module would need to be
update to include the raw morphological data first.

The text format is so simple, you can do analysis with tools
like grep and sed!  Surely we can get jsword up to par with
grep and sed.  What are your thoughts on this?

Blessings,

Steve



-----Original Message-----
From: DM Smith [mailto:dmsmith555 at yahoo.com]
Sent: Wed, May 21, 2008 1:33 PM
To: J-Sword Developers Mailing List
Subject: Re: [jsword-devel] Lucene Indexes


Mullins, Steven wrote:
> I have jsword indexing the lexical forms and the robinson codes
> for the MorphGNT module.  The syntax is:
>
> rob:n-nsm && lex:?????
>   
Hmm. I thought that robinson morphology was already handled by JSword by 
stuffing it in morph:

> This will search for all verses with the lexical form "?????"
> and the robinson morphological code n-nsm.  However, "?????"
> can be anywhere in the verse and the "n-nsm" tag can apply to
> any word in the verse.  I'd like to restrict the search so that
> the robinsons search applies only to a particular word.  For
> example:
>
> (lex:????? WITH rob:n-nsm)
> Translation: Search all words with lexical form "?????", which 
> also has a robinson's code of "n-nsm".  
>
> I don't know how or if Lucene establishes the relationship between
> fields.  Is there a way to establish a link between the <content>
> field and <lex> and <rob> field? 
>   
This would be a great question for the lucene-users mailing list.

As far as I know, this has not been done. But, it appears that there is 
enough information held in the index to perform such a search.

That is, each term (token?) in the index is tied to it's offset and 
length in the text and each is given position. For each field, the first 
term would be 1, the next 2, .... Thus, two fields can be parallel arrays.

Also, it is possible to fudge the position increment, such as when the 
<w> element is being processed to have each word that is stuffed into 
the content field, have the same position per <w> element.

This would provide morph:, lex:, content:, .... a way to be connected in 
parallel.

The other way, would be to think of each field as a table in a database, 
indexed by document number and ignore the whole notion of position.

Then, one would create fields for relationships, so the morph <-> lex 
relationship would be held in a morph_lex: field and searched as such.

Then one could search:
morph_lex:("xxx n-nsm")

The obvious problem with this is one can only exploit relationships that 
are explicitly defined, while the first solution is more general.

The trick would be to synthesize combo search expressions on the fly.
> Perhaps this is already done, but if so, I do not know the syntax
> to employ it.
>
> Thanks,
>
> Steve
>
> -----Original Message-----
> From: DM Smith [mailto:dmsmith555 at yahoo.com]
> Sent: Mon, May 19, 2008 12:00 PM
> To: J-Sword Developers Mailing List
> Subject: Re: [jsword-devel] Lucene Indexes
>
>
> Mullins, Steven wrote:
>   
>> The beauty of the MorphGNT module is that the analysis is already done!
>> for every inflected word, you have tagged the lexical form (to search by)
>> and the morpological tag (to narrow a search).  So for example if I wanted
>> to to search for all verses with "believe" in the first person active
>> indicative with "Jesus" as a direct object, I could, if only I had the 
>> lexical form and morph tags in lucene working.
>>
>> Just my 2-cents.
>>
>> Steve
>>   
>>     
>
> I think it's worth more than a couple of cents!
>
> One of the thing's that we have on our todo list (Jira's down 
> indefinitely, so don't bother looking) is to create a Strong's index 
> that could be used for any module. So if anyone ever had strong:(H3068 
> AND H3069) in a search, they would find the verse in their favorite text.
>
> We could do something similar with the analysis of MorphGNT.
>
> BTW, I welcome contributions as I'll be focusing on RTL issues, 
> translations into other languages and Bookmarks.
>
> In Him,
>     DM
>
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
>
>   



_______________________________________________
jsword-devel mailing list
jsword-devel at crosswire.org
http://www.crosswire.org/mailman/listinfo/jsword-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jsword.zip
Type: application/x-zip-compressed
Size: 1498 bytes
Desc: jsword.zip
Url : http://www.crosswire.org/pipermail/jsword-devel/attachments/20080522/5641b7a2/attachment.bin 


More information about the jsword-devel mailing list