[jsword-devel] Versification representations...

Mon Nov 9 05:47:13 MST 2009

On Nov 8, 2009, at 8:35 AM, Chris Burrell wrote:

> Hi all
>
> I'm on a project using JSword and we're linking Scripture to various  
> reference materials. For various reasons, we will need to store  
> references to our material in the database and not in a Sword  
> module. Does anyone have any views as to the best way of storing the  
> verse/passage reference?

JSword has the notion of BookDriver being the storage representation  
of a Book. If you haven't found it yet, look in the JSword "limbo  
project", which holds nascent and obsolete code (http://crosswire.org/svn/jsword/trunk/jsword-limbo 
) you see some various starts of BookDrivers. For example, under src/ 
main/java, the package o.c.j.book.jdbc has a nascent implementation of  
a JDBC driver.

It should give you an idea of where to start.

>
> I've come up with a few solutions, but not sure which would be the  
> most efficient, speed, space, etc.-wise... It will be used mainly to  
> serve the questions:
> "Does the bible reference provided by the user match any of our  
> material?", or more simply, "Does passage A overlap with passage B"?
>
> Our material will be keyed by passage as opposed to per verse, ie.  
> ranges of verses...
>
> 1- Storing the reference as is in the database, say 1Kings. 
> 2:10-15;3:2-3 would mean we'd have to do lots of string  
> manipulations to figure out overlap of 2 references

In JSword a Verse is a single point in a document, a VerseRange is a  
collection of adjacent Verses and a Passage is a collection of  
VerseRanges that may or may not be adjacent. The degenerative case for  
a VerseRange is a single Verse and the degenerative case of a Passage  
is a VerseRange (thus a Verse).

JSword at this point has a notion of a Verse being a specialized Key  
and there is no notion of a KeyRange or of a Passage being a  
collection of Keys. At some point we need to unify this. Then  
everything will be a Key.

Regarding Passage, there are several different types, each of which  
has it's own performance characteristic. Two to look at are  
BitwisePassage and PassageTally.

RocketPassage is essentially a bit map where each Verse is represented  
by a bit. Obviously there needs to be a two way mapping between the  
bits and the Verse reference. This is the most frequently used Passage  
in JSword and is the best representation for canonically ordered verses.

PassageTally is a list of weighted Verses. This is used for a best- 
match, prioritized search result.

Internally, SWORD has the notion of linking but this is not yet  
exposed in Key or Verse, but it is present in JSword's drivers for  
SWORD modules. We'll be exposing this before too long. JSword will  
have routines in Key and Verse to determine whether a Verse is linked  
to the another Verse and whether a Verse is part of a set of linked  
verses. In JSword lingo, a link is an Alias.

Currently, a SWORD commentary is laid out as a Bible, where an entry  
may be a range of verses. JSword hasn't exposed the notion of  
aliasing. You can see this in BibleDesktop by setting the Option/ 
Preference to show Commentaries as Bibles and pick a commentary in the  
list of Bibles. You'll see that the content repeats. That is, if Gen  
1:1-3 is shown and this is stored in the SWORD module as a linked set,  
then it is shown 3 times, once for each of Gen 1:1, Gen 1:2 and Gen  
1:3. It should not do this, but there are no methods in Key that will  
help determine that Gen 1:1 is part of a link set that encompasses  
verses 2 and 3. Then in parallel view we would set the colspan  
attribute to 3 for Gen 1.1 and when showing the verse number we'd show  
1-3. Then when processing 2 and 3 we'd skip them as they link back to 1.

So to answer your bullet, JSword at a minimum needs (the names and  
signatures can change)
/**
  * Determine whether this and another reference are aliases of each  
other. That is do the have the same getRawText().
  * @param Key the key to compare.
  * @return true of this is an alias of that
abstract boolean isAliased(Key that);
/**
  * Get the Key representing a range of adjacent, aliased keys for a  
single Key.
  * This operation is potentially very expensive.
  * @return the aliased set.
  * @throws IllegalArgumentException if this Key has a cardinality > 1.
  */
abstract Key getAliases();

Then with these primitives, one can determine overlap.

>
> 2- Storing the beginning and end of each section say as a group of  
> sub references { (book, start_chapter, start_verse, end_chapter,  
> end_verse)+ }.
> In this case a reference would be many of the previous definition. I  
> can see how we could work it out here, but i can see also having to  
> do lots of index range scans on our database

Working backwards, there needs to be a table to hold referenced pieces  
of text. Something like:
create table BiblicalText (
    location NUMBER NOT NULL,
    text        CLOB         NOT NULL,
    CONSTRAINT BT_pk PRIMARY KEY (location)
);

Then there needs to be a way to get a BiblicalText getting a Verse.  
Something like:
create table VerseReference (
     ordinal NUMBER NOT NULL, -- Represents the canonical order of the  
verse in a Bible, where 1 is the first verse, 2 is the next and so on.
     reference STRING NOT NULL, -- a normalized reference for the  
Verse, e.g. an osisID or  a tuple(t,b,c,v)
     location NUMBER NOT NULL,
     CONSTRAINT bt_fk FOREIGN KEY (location) REFERENCES BiblicalText 
(location)
);

While JSword has not exposed it yet, you might want to represent  
testament, book and chapter introductions as specially named  
references. SWORD uses the tuple approach with something like:
(Testament, Book, Chapter, Verse)
(0,x,x,x) -- OT Introduction
(1,0,x,x) -- NT Intro
(1,1,0,x) -- Book intro to Matt
(1,1,1,0) -- Chapter intro to Matt 1
Note, The x values should be 0, but really are don't care values.  
JSword will be following this notion, which will squeeze in more bits  
into the bit maps.

I think versification needs to be held separately from storage and  
essentially is an optimization of it. If you look at  
o.c.j.book.versification.BookInfo you'll see the fundamental way  
JSword maps a number to a verse reference.
This will be extended to other fixed canons. Essentially one needs to  
know the order and names of books, the number of chapters per book and  
the number of verses per chapter. Not as actuals but as potentials.  
That is, a Book which is a translation of an old, damaged scroll might  
not have all verses in a chapter or even all chapters. The map should  
represent the undamaged scroll.

This argues for an ordered list of Books and the number of chapters in  
each book.
create table BookNames (
     bookNumber NUMBER NOT NULL,
     bookName    STRING NOT NULL,
     chapterCount NUMBER NOT NULL,
     CONSTRAINT bn_pk PRIMARY KEY (bookNumber)
);

And the number of verses in each chapter:
create table Chapter (
    bookNumber       NUMBER NOT NULL,
    chapterNumber  NUMBER NOT NULL,
    chapterSize         NUMBER NOT NULL,
    CONSTRAINT ch_bn_fk FOREIGN KEY (bookNumber) REFERENCES BookNames 
(bookNumber)
);

At this time JSword is stuck on the KJV versification, so going beyond  
that will be a re-implementation of BookInfo. This is planned, but has  
not happened yet. Basically, we'll have file representation that  
JSword can read in to answer the questions. In essence, we'll use a  
file as the database. The form probably will be a serialization of a  
populated BookInfo object.

>
> 3- Number each verse of the bible from 1 to 30000 or something like  
> that, and then workout and store each verse that is included in the  
> reference in a table somewhere in the database
> The benefit here seems to be that we would get lots of unique index  
> lookups, but maybe the number of lookups would actually be better  
> off doing range scans... Also, we would have quite a bit more disk  
> space overhead, if we're storing a row for each verse.
>
> 4- Number each verse of the bible from 1 to 30000 or something like  
> that, but only store the ranges say verses 30-140 + verses 1500-1512
>
> 5- Numbering each verse, but keying the numbers by book, say Exodus  
> verse 750, 751, 752, etc.
>
> 6- We have the benefit of working in Java with a Java database, and  
> so could write a Java stored procedure to parse whatever solution  
> comes out of here... It would have to be fast though, given the end  
> product is the web, and the main activity will searching across  
> scripture references...

I'm a bit confused here. Your discussion is that of a relational  
database, but you refer to a Java database. By a Java database, do you  
mean a relational DBMS that is addressed via JDBC or do you mean an  
OODBMS, such as Objectivity or ObjectStore? These days, next to no one  
uses an OODBMS.

>
> We'll be doing many searches on our data and updating it only very  
> rarely. I'm a bit at a loss as to best way of doing this... How does  
> JSword cope with this? or does it uniquely do scripture lookups and  
> not scripture overlap? (ie. working out whether two portions of  
> scripture overlap with each other).

Having said all that. I have another idea. It is not well thought out.  
How about creating a JDBC driver for a SWORD module? Still use Lucene  
to create a secondary index for searching. The trick would be to map  
prepared statements to JSword calls.

SWORD is a static database of all kinds of books. It is highly  
optimized for lookups. It consists of index and data files per  
testament.

The index can be thought of an array indexed by verse number (This is  
the same as VerseReference.ordinal in the above table), holding a  
pointer (offset and size) into the data file. If two entries in the  
index point to the same location in the data file, then they are  
aliases/links.

Since updates are infrequent, this works well. When a verse is  
updated, it is appended to the data file and all the references to it  
are updated to point to the new location. Over time, the data file  
will grow and should be rebuilt from scratch when it gets "too big".

>
> Any ideas anyone?
> Chris
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel