V11n was Re: [sword-devel] Jonah 1.17 / 2.1

DM Smith dmsmith555 at yahoo.com
Thu Mar 23 11:11:35 MST 2006


DavidTroidl at aol.com wrote:
> Hi,
>  
> I also have several issues with osis2mod, and I was getting ready to 
> post.  The fact is that there are several versification schemes for 
> both Old and New Testaments.  I was having a similar problem with 
> re-versification in Tischendorf's Greek New Testament.  It has John 
> 1:52, because an earlier verse is sub-divided.  But it also has 3John 
> 15 and Rev 12:18, which agrees with UBS 4.
>  
> How can we get osis2mod to recognize true variations in versification, 
> and not "standardize" everything?
A SWORD module consists of text (possibly compressed) and an index into 
that text. (Compressed modules will have additional tables marking the 
start and end of the compression unit. But I am ignoring them in the 
discussion below.)

In a nutshell, the code needs to be changed both that which creates the 
index and that which reads it.

Here is an overview of how it all hangs together. This may be a bit 
imprecise because the JSword implementation, which I work on and am 
familiar, may be slightly different from the actual SWORD API 
implementation.

The index is a big fixed size array with each entry giving the start and 
length of each verse. There are slots for "introductions" to chapters 
and books, e.g. Gen.0 would give the intro to Genesis and Gen.1.0 would 
give an introduction to Genesis Chapter 1.

Lookup happens in this fashion, the verse reference is first normalized 
(e.g. Matthew 1:5 might become Matt.1.5) And then this is re-normalized 
into 40.1.5. Then that normalization is converted into an index into the 
fixed size array via a lookup table.

In the same fashion, the index is created. As the input is parsed, the 
verse body is substringed and titles which are immediately before the 
verse are marked as pre-verse and prepended to the verse. The verse 
reference is converted into the array index. The verse is written to the 
output file and the start of that verse in the output file is recorded 
in the index along with its length.

You will note that the verses are laid down in the output file in the 
order that they are in the input file. If a verse exists more than once 
in the input, I think both get written to the output file, but the last 
one over-writes the first in the index. If a verse pertains to more than 
one KJV verse (e.g. <verse osisID="Gen.1.1 Gen.1.2"> text of Genesis 1.1 
and Genesis 1.2</verse>) then this is recorded in two index slots that 
point to the same place in the output file. It is possible to feed a 
correction to a module of just the changed verses. This will then be 
appended to the output file and the index will be updated to reflect the 
new material. The old material still remains.

When a verse reference is outside of the KJV v11n, it is recognized as a 
problem. Now there are only so many ways that the program can handle it. 
It could reject it. Or in the case of JSword, if the "book" and 
"chapter" are in the KJV v11n, then it figures out which verse is really 
meant by adding it to start of the chapter. So Matt 1:27 would silently 
become Matt 2:2. Later when Matt 2:2 is seen, it would overwrite the 
earlier entry in the index and Matt 1:27 would be lost. There may be 
other strategies. But in every case it will not produce the desired results.

Here is how I would suggest implementing a solution to this problem: use 
OSIS documents and use lucene with osisIDs as the keys.

I have found that lucene is very fast. Input references would be 
normalized to osisIDs and these be used for lookup. Rather than storing 
the document in this index, the original would be left on disk as is 
(perhaps compressed by verse, chapter or book as we do today). The index 
would store start offset and end offset for each and every osisID in the 
document. The start offset would be to the beginning of the element and 
the end offset would be to the end of the element. In the case of 
milestoned elements, it would be from the start of the sID element to 
the end of the corresponding eID element. It could also handle multiple 
documents by storing the document names as well.

Handling a "passage", say Gen 50:2 - Ex 2 would become an osisRef of 
Gen.50.1-Exod.2. This in turn would indicate the start and end of the 
fragment in the document as the start offset of Gen.50.1 and the end 
offset of Exod.2.

This solution allows:
    for books of the bible to be in any order as required for a 
particular work.
    for there to be any number of chapters in a book,
    for there to be any number of verses in a chapter
    for there to be prefaces, introductions, titles, colophons, 
appendices  and any other elements allowed by OSIS.
    for the apocrypha to be before or after the NT or in a separate file.
    for each book or a set of books to be in separate files (in fact, 
one could go to the absurd level of doing it by paragraph).
    for any other book (e.g. dictionary, Koran, ...) with a well defined 
hierarchical system of reference to be index or stored.
    for the OSIS documents to be used for any other purpose by any other 
system that can handle OSIS docs (ignoring compression and encryption;)
       (Maybe we don't want this last one;)

I would also advocate storing two other contexts: one for a minimal 
well-formed xml fragment and one for a minimum display context (which 
would also be a well-formed xml fragment) The reason for these is that 
OSIS does not require that a verse, chapter or any other division be 
well formed. It only requires that the divs that are children of the 
osisText element be well formed.

Well-formedness is a requirement for using xml processors (which JSword 
uses). So having a minimal xml context will solve that.

The display context is needed to provide enough information to render 
the verse correctly. Two examples: First, in poetry (e.g. a Psalm), a 
verse may be wholly contained in a line of a "poem" and thus be well 
formed, but unless it is seen as part of the whole, it cannot be 
correctly rendered. Second, consider the word's of Jesus (always a good 
idea:). It may be that a much earlier verse records that the selected 
verse are the words of Jesus and a much later verse records that it his 
speech ends. Looking at the verse in isolation, it is impossible to know 
that the verse contains the Jesus' words. So in trying to apply 
red-letter text to his words would fail when looking at the verse alone. 
The trick would be deciding what constitutes a display context. It 
should at least encompass the larger of the paragraphs, quotes, speeches 
or line groups in which the verse appears/intersects, if any.

The other advantage to using Lucene is that the indexes can be changed 
to add more information at a later time and existing processes would not 
need to be changed unless they were to take advantage of the additions. 
A given application, say BibleTime, could augment the index with further 
information (e.g. notes, internal processing info, ...) and BibleDesktop 
could use that index without needing to handle that additional info.

Of course, the above does not solve the mapping of one v11n scheme to 
another.



More information about the sword-devel mailing list