[jsword-devel] Versification representations...
DM Smith
dmsmith at crosswire.org
Mon Nov 9 05:47:13 MST 2009
On Nov 8, 2009, at 8:35 AM, Chris Burrell wrote:
> Hi all
>
> I'm on a project using JSword and we're linking Scripture to various
> reference materials. For various reasons, we will need to store
> references to our material in the database and not in a Sword
> module. Does anyone have any views as to the best way of storing the
> verse/passage reference?
JSword has the notion of BookDriver being the storage representation
of a Book. If you haven't found it yet, look in the JSword "limbo
project", which holds nascent and obsolete code (http://crosswire.org/svn/jsword/trunk/jsword-limbo
) you see some various starts of BookDrivers. For example, under src/
main/java, the package o.c.j.book.jdbc has a nascent implementation of
a JDBC driver.
It should give you an idea of where to start.
>
> I've come up with a few solutions, but not sure which would be the
> most efficient, speed, space, etc.-wise... It will be used mainly to
> serve the questions:
> "Does the bible reference provided by the user match any of our
> material?", or more simply, "Does passage A overlap with passage B"?
>
> Our material will be keyed by passage as opposed to per verse, ie.
> ranges of verses...
>
> 1- Storing the reference as is in the database, say 1Kings.
> 2:10-15;3:2-3 would mean we'd have to do lots of string
> manipulations to figure out overlap of 2 references
In JSword a Verse is a single point in a document, a VerseRange is a
collection of adjacent Verses and a Passage is a collection of
VerseRanges that may or may not be adjacent. The degenerative case for
a VerseRange is a single Verse and the degenerative case of a Passage
is a VerseRange (thus a Verse).
JSword at this point has a notion of a Verse being a specialized Key
and there is no notion of a KeyRange or of a Passage being a
collection of Keys. At some point we need to unify this. Then
everything will be a Key.
Regarding Passage, there are several different types, each of which
has it's own performance characteristic. Two to look at are
BitwisePassage and PassageTally.
RocketPassage is essentially a bit map where each Verse is represented
by a bit. Obviously there needs to be a two way mapping between the
bits and the Verse reference. This is the most frequently used Passage
in JSword and is the best representation for canonically ordered verses.
PassageTally is a list of weighted Verses. This is used for a best-
match, prioritized search result.
Internally, SWORD has the notion of linking but this is not yet
exposed in Key or Verse, but it is present in JSword's drivers for
SWORD modules. We'll be exposing this before too long. JSword will
have routines in Key and Verse to determine whether a Verse is linked
to the another Verse and whether a Verse is part of a set of linked
verses. In JSword lingo, a link is an Alias.
Currently, a SWORD commentary is laid out as a Bible, where an entry
may be a range of verses. JSword hasn't exposed the notion of
aliasing. You can see this in BibleDesktop by setting the Option/
Preference to show Commentaries as Bibles and pick a commentary in the
list of Bibles. You'll see that the content repeats. That is, if Gen
1:1-3 is shown and this is stored in the SWORD module as a linked set,
then it is shown 3 times, once for each of Gen 1:1, Gen 1:2 and Gen
1:3. It should not do this, but there are no methods in Key that will
help determine that Gen 1:1 is part of a link set that encompasses
verses 2 and 3. Then in parallel view we would set the colspan
attribute to 3 for Gen 1.1 and when showing the verse number we'd show
1-3. Then when processing 2 and 3 we'd skip them as they link back to 1.
So to answer your bullet, JSword at a minimum needs (the names and
signatures can change)
/**
* Determine whether this and another reference are aliases of each
other. That is do the have the same getRawText().
* @param Key the key to compare.
* @return true of this is an alias of that
abstract boolean isAliased(Key that);
/**
* Get the Key representing a range of adjacent, aliased keys for a
single Key.
* This operation is potentially very expensive.
* @return the aliased set.
* @throws IllegalArgumentException if this Key has a cardinality > 1.
*/
abstract Key getAliases();
Then with these primitives, one can determine overlap.
>
> 2- Storing the beginning and end of each section say as a group of
> sub references { (book, start_chapter, start_verse, end_chapter,
> end_verse)+ }.
> In this case a reference would be many of the previous definition. I
> can see how we could work it out here, but i can see also having to
> do lots of index range scans on our database
Working backwards, there needs to be a table to hold referenced pieces
of text. Something like:
create table BiblicalText (
location NUMBER NOT NULL,
text CLOB NOT NULL,
CONSTRAINT BT_pk PRIMARY KEY (location)
);
Then there needs to be a way to get a BiblicalText getting a Verse.
Something like:
create table VerseReference (
ordinal NUMBER NOT NULL, -- Represents the canonical order of the
verse in a Bible, where 1 is the first verse, 2 is the next and so on.
reference STRING NOT NULL, -- a normalized reference for the
Verse, e.g. an osisID or a tuple(t,b,c,v)
location NUMBER NOT NULL,
CONSTRAINT bt_fk FOREIGN KEY (location) REFERENCES BiblicalText
(location)
);
While JSword has not exposed it yet, you might want to represent
testament, book and chapter introductions as specially named
references. SWORD uses the tuple approach with something like:
(Testament, Book, Chapter, Verse)
(0,x,x,x) -- OT Introduction
(1,0,x,x) -- NT Intro
(1,1,0,x) -- Book intro to Matt
(1,1,1,0) -- Chapter intro to Matt 1
Note, The x values should be 0, but really are don't care values.
JSword will be following this notion, which will squeeze in more bits
into the bit maps.
I think versification needs to be held separately from storage and
essentially is an optimization of it. If you look at
o.c.j.book.versification.BookInfo you'll see the fundamental way
JSword maps a number to a verse reference.
This will be extended to other fixed canons. Essentially one needs to
know the order and names of books, the number of chapters per book and
the number of verses per chapter. Not as actuals but as potentials.
That is, a Book which is a translation of an old, damaged scroll might
not have all verses in a chapter or even all chapters. The map should
represent the undamaged scroll.
This argues for an ordered list of Books and the number of chapters in
each book.
create table BookNames (
bookNumber NUMBER NOT NULL,
bookName STRING NOT NULL,
chapterCount NUMBER NOT NULL,
CONSTRAINT bn_pk PRIMARY KEY (bookNumber)
);
And the number of verses in each chapter:
create table Chapter (
bookNumber NUMBER NOT NULL,
chapterNumber NUMBER NOT NULL,
chapterSize NUMBER NOT NULL,
CONSTRAINT ch_bn_fk FOREIGN KEY (bookNumber) REFERENCES BookNames
(bookNumber)
);
At this time JSword is stuck on the KJV versification, so going beyond
that will be a re-implementation of BookInfo. This is planned, but has
not happened yet. Basically, we'll have file representation that
JSword can read in to answer the questions. In essence, we'll use a
file as the database. The form probably will be a serialization of a
populated BookInfo object.
>
> 3- Number each verse of the bible from 1 to 30000 or something like
> that, and then workout and store each verse that is included in the
> reference in a table somewhere in the database
> The benefit here seems to be that we would get lots of unique index
> lookups, but maybe the number of lookups would actually be better
> off doing range scans... Also, we would have quite a bit more disk
> space overhead, if we're storing a row for each verse.
>
> 4- Number each verse of the bible from 1 to 30000 or something like
> that, but only store the ranges say verses 30-140 + verses 1500-1512
>
> 5- Numbering each verse, but keying the numbers by book, say Exodus
> verse 750, 751, 752, etc.
>
> 6- We have the benefit of working in Java with a Java database, and
> so could write a Java stored procedure to parse whatever solution
> comes out of here... It would have to be fast though, given the end
> product is the web, and the main activity will searching across
> scripture references...
I'm a bit confused here. Your discussion is that of a relational
database, but you refer to a Java database. By a Java database, do you
mean a relational DBMS that is addressed via JDBC or do you mean an
OODBMS, such as Objectivity or ObjectStore? These days, next to no one
uses an OODBMS.
>
> We'll be doing many searches on our data and updating it only very
> rarely. I'm a bit at a loss as to best way of doing this... How does
> JSword cope with this? or does it uniquely do scripture lookups and
> not scripture overlap? (ie. working out whether two portions of
> scripture overlap with each other).
Having said all that. I have another idea. It is not well thought out.
How about creating a JDBC driver for a SWORD module? Still use Lucene
to create a secondary index for searching. The trick would be to map
prepared statements to JSword calls.
SWORD is a static database of all kinds of books. It is highly
optimized for lookups. It consists of index and data files per
testament.
The index can be thought of an array indexed by verse number (This is
the same as VerseReference.ordinal in the above table), holding a
pointer (offset and size) into the data file. If two entries in the
index point to the same location in the data file, then they are
aliases/links.
Since updates are infrequent, this works well. When a verse is
updated, it is appended to the data file and all the references to it
are updated to point to the new location. Over time, the data file
will grow and should be rebuilt from scratch when it gets "too big".
>
> Any ideas anyone?
> Chris
>
> _______________________________________________
> jsword-devel mailing list
> jsword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/jsword-devel
More information about the jsword-devel
mailing list