[sword-devel] verse parsing

Wed Mar 29 02:37:11 MST 2006

First, on the topic of OSIS book abbreviations:
Almost everything you should ever need for Bibles is at 
http://www.crosswire.org/~chrislit/osis/BibleBookNames.html

There are also the following, less up-to-date xml files, which add more 
non-canonical materials. These were the the source materials for the 
above, but I haven't maintained them since creating the above list of 
Bible books.
Bible: http://www.crosswire.org/~chrislit/osis/bible.xml
OT Pseudepigrapha: http://www.crosswire.org/~chrislit/osis/otp.xml
NT Apocrypha: http://www.crosswire.org/~chrislit/osis/nta.xml
Nag Hammadi codices: http://www.crosswire.org/~chrislit/osis/naghammadi.xml
(named) Dead Sea Scrolls: http://www.crosswire.org/~chrislit/osis/qumran.xml
Mormon texts: http://www.crosswire.org/~chrislit/osis/lds.xml
Classical sources (but actually just Josephus, currently): 
http://www.crosswire.org/~chrislit/osis/classical.xml

Now, looking at the list of files at the LXXM source site 
(http://ccat.sas.upenn.edu/gopher/text/religion/biblical/lxxmorph/), 
there are four categories of problems with mapping files onto OSIS IDs:

1) Books with <number>.<abbrev>.<number>.mlxx style filenames, e.g. 
01.Gen.1.mlxx & 02.Gen.2.mlxx. These are just single books divided into 
two files and should be concatenated.

2) Apocryphal books. These should all be listed in the file listed at 
the top. E.g. Judith = Jdt, Tobit = Tob, Odes = Odes, Psalms of Solomon 
= PssSol.

3) Ezras. The Ezras are just absurdly icky. For the LXX, I recommend NOT 
just mapping 1Esdras to Ezra and 2Esdras to Nehemiah. The don't actually 
line up correctly like this. Whole volumes could and probably have been 
written about the Ezras, and I would strongly recommend just tagging 
them 1Esd and 2Esd, respectively.

[Specifically:
Hebrew Ezra = Vulgate 1Esd = KJV Ezra
Hebrew Neh = Vulgate 2Esd = KJV Neh
LXX 1Esd = Vulgate 3Esd = KJV 1Esd = 2Chr 35-36 paraphrased + Ezra + Neh 
7:38-8:12 + other material
LXX 2Esd = Hebrew Ezra+Neh = Vulgate 1Esd+2Esd = KJV Ezra+Neh
And 4Esd(=4Ezra+5Ezra+6Ezra) makes things even more complicated--but 
luckily isn't of import since it isn't in the LXX.]

4) Variant books, namely (Josh|Judges)(B|A), Tobit(BA|S), 
(Daniel|Bel|Sus)(OG|Th)--6 books with 2 variants each. I would strongly 
recommend treating each of these 12 books as individual books. Give them 
unique osisIDs, present them to the user as unique books, etc. This is 
how Logos does it. This is how BibleWorks does it. And I believe STEP 
even incorporated a separate book ID to account for the 6 additional 
books in Rahlfs. Rahlfs is a sufficient important source text that you 
really ought to do whatever you need to do to accommodate it in its 
native form. You should wedge it into another versification system (e.g. 
one with only one book each of Joshua, Judges, Tobit, Daniel, Bel & the 
Dragon, and Susanna).

I don't have my Rahlfs with me, but I really don't think presenting it 
in a tabular view with both traditions on a single screen is the right 
way to go. If we're working within the KJV versification, that's a 
suitable compromise. But if we're permitted to make changes to the 
underlying versification system in Sword and present Rahlfs in its OWN 
versification system, the books should be separated.

Towards that end, I would recommend adding 6 books to the 
BibleBookNames.html file cited at the top, to accomodate the 6 variant 
books in Rahlfs: JoshA, JudgA, TobS, DanTh, BelTh, & SusTh. Under this 
system, JoshB = osisID Josh, JudgesA = osisID Judg, TobBA = osisID Tob, 
and the OG Daniel texts = osisIDs Dan, Bel, and Sus. Does that seem 
agreeable?

The only other way to deal with them is to call them part of a separate 
work and use the standard book IDs for both, but put the variants in the 
second work. I don't like that idea since they're part of the same print 
volume, a volume which is generally considered a single work.

A few more comments below...

Troy A. Griffitts wrote:
>     Obviously, my goal was to save everyone as much modification as 
> possible, but there just doesn't seem like there is a good fit for 
> modules like these.

I think DM, Martin, and I agree on this point: make it work correctly, 
regardless of how badly it breaks existing frontends. We can make 
modules requiring a new driver invisible to existing frontends and 
future frontends can support new features when they are ready to do so.

> The next thing I began to realize is that this module uses a,b,c type 
> suffixes on verses (click on the first link in this email again and 
> scroll to the bottom of the page).  This does not fit nicely into our 
> integer concept for verses.  I considered adding a 5th level: 
> Testament/Book/Chapter/Verse/Sub.  But this really breaks the whole 
> paradigm anyway, as sub will mostly be blank except when there might be 
> a letter tacked to the end.  It really doesn't solve any problems, e.g. 
> key.Verse(key.Verse()+1) still will break.  key++ would work, I guess, 
> but you'd have to always check if Sub was set to anything.  And who 
> knows what Sub really means.  Is it a replacement?  Is it really a 
> subdivision of the verse?  It just doesn't seem like it solves any 
> problems nicely.  It seems like the LXX really is sequentially 31, 31a, 
> 32, 33, 33a, 33b.  When I know that other Bibles and commentaries mean 
> the first part of 33 when they say 33a.  So adding Sub doesn't seem like 
> it gives us much except keeping Verse an integer.

We need to deal with non-integers for chapters in Greek Esther as in the 
NRSV also. In addition, those chapters aren't in sequential numerical or 
alpha-numeric order. So we'll have to deal with out-of-order chapters 
and, probably, verses. GenBooks handle that fine. Translation to 
VerseKeys is going to be a challenge.

> The 'reference' is display like:
> 
> /JoshB/24/1
> 
> We could add a flag which says to display using a BK CH:VS format.  I 
> was thinking about adding a pattern, like letting the modules.conf file 
> specify something like:
> KeyDisplay=%1 %2:%3
> but I think this is more work for everyone than it benefits.  Besides, 
> other languages probably prefer other formats (BK CH.VS).  So I think 
> we'd like to just say something like KeyFormat=BCV

That looks like a great idea. Other LANGUAGES shouldn't be allowed to 
modify the formatting of a text. On the other hand, giving other TEXTS 
the ability to have customized presentation would be a great benefit, 
and this accommodates that very well. For example, the print NRSV Oxford 
Study Bible that I have uses BK CH.VS.

> The other problem is parsing...
> Currently VerseKey provides all the nice parsing functionality that 
> figures out:
> 
> Ijn2-3:12
> 
> It can do this because it has a set of books that it know about, along 
> with all kinds of abbreviations and translated into a number of 
> languages.  Our current parser also drops suffixed letters.

I think part of the solution is to make the parser more generalized and 
to force the module to give it some parameters for parsing. Each module 
needs to tell the parser something like 1) the format and 2) valid 
books. The format might be something like a PERL regular expression: 
"($book) ([0-9]+):([0-9]+)([a-c])", where the parser then picks out the 
book, chapter, verse, and sub-verse. I have no recommendations for 
implementation and don't even know whether it is feasible.

The list of valid books is simpler. Every modules should simply provide 
an ordered list of its contents (in osisID form, naturally). The parser 
then constructs a list of possible book abbreviations to use in parsing, 
excluding those books not present. For example, the LXXM is going to 
include Judith, but not Jude. So the parser would include all the 
abbreviations for Judith, but not those for Jude, and a reference to "Ju 
1:1" should parse as Jdt.1.1.

> Finally, if we solve these problems, and place an entry in LXXM: 
> Category=Biblical Texts, it will probably break most frontends which 
> expect all Biblical Texts to use a VerseKey.  I don't know how to solve 
> this problem.

I would just give it a different Category.

> I also considered a major change to VerseKey which would make all levels 
> strings and not integers.  I realize many frontends use integer spin 
> controls to increase/decrease chapter and verse.  There may also be 
> linear logic regarding these things.

Unfortunately I can't think of a better solution to handling the array 
of versification systems that exist. I think that's why we went with 
strings in OSIS.

--Chris