[jsword-devel] Versification representations...
Chris Burrell
christopher at burrell.me.uk
Mon Nov 9 06:33:17 MST 2009
Wow! Thanks for all that. It defintely gives me areas to think about... By
Java database I meant a RDBMS written in Java (Java/Apache Derby),
configured in our case to run as part of the same JVM as our app. What I
meant earlier, is that with Derby, you basically declare new procedures as
static java methods, and then when it comes to executing your procedure, it
uses reflection to invoke your static method.
And thanks for the code sample pointers in the limbo area! I'll chew through
those...
Chris
2009/11/9 DM Smith <dmsmith at crosswire.org>
>
> On Nov 8, 2009, at 8:35 AM, Chris Burrell wrote:
>
> Hi all
>>
>> I'm on a project using JSword and we're linking Scripture to various
>> reference materials. For various reasons, we will need to store references
>> to our material in the database and not in a Sword module. Does anyone have
>> any views as to the best way of storing the verse/passage reference?
>>
>
> JSword has the notion of BookDriver being the storage representation of a
> Book. If you haven't found it yet, look in the JSword "limbo project", which
> holds nascent and obsolete code (
> http://crosswire.org/svn/jsword/trunk/jsword-limbo) you see some various
> starts of BookDrivers. For example, under src/main/java, the package
> o.c.j.book.jdbc has a nascent implementation of a JDBC driver.
>
> It should give you an idea of where to start.
>
>
>
>> I've come up with a few solutions, but not sure which would be the most
>> efficient, speed, space, etc.-wise... It will be used mainly to serve the
>> questions:
>> "Does the bible reference provided by the user match any of our
>> material?", or more simply, "Does passage A overlap with passage B"?
>>
>> Our material will be keyed by passage as opposed to per verse, ie. ranges
>> of verses...
>>
>> 1- Storing the reference as is in the database, say 1Kings.2:10-15;3:2-3
>> would mean we'd have to do lots of string manipulations to figure out
>> overlap of 2 references
>>
>
> In JSword a Verse is a single point in a document, a VerseRange is a
> collection of adjacent Verses and a Passage is a collection of VerseRanges
> that may or may not be adjacent. The degenerative case for a VerseRange is a
> single Verse and the degenerative case of a Passage is a VerseRange (thus a
> Verse).
>
> JSword at this point has a notion of a Verse being a specialized Key and
> there is no notion of a KeyRange or of a Passage being a collection of Keys.
> At some point we need to unify this. Then everything will be a Key.
>
> Regarding Passage, there are several different types, each of which has
> it's own performance characteristic. Two to look at are BitwisePassage and
> PassageTally.
>
> RocketPassage is essentially a bit map where each Verse is represented by a
> bit. Obviously there needs to be a two way mapping between the bits and the
> Verse reference. This is the most frequently used Passage in JSword and is
> the best representation for canonically ordered verses.
>
> PassageTally is a list of weighted Verses. This is used for a best-match,
> prioritized search result.
>
> Internally, SWORD has the notion of linking but this is not yet exposed in
> Key or Verse, but it is present in JSword's drivers for SWORD modules. We'll
> be exposing this before too long. JSword will have routines in Key and Verse
> to determine whether a Verse is linked to the another Verse and whether a
> Verse is part of a set of linked verses. In JSword lingo, a link is an
> Alias.
>
> Currently, a SWORD commentary is laid out as a Bible, where an entry may be
> a range of verses. JSword hasn't exposed the notion of aliasing. You can see
> this in BibleDesktop by setting the Option/Preference to show Commentaries
> as Bibles and pick a commentary in the list of Bibles. You'll see that the
> content repeats. That is, if Gen 1:1-3 is shown and this is stored in the
> SWORD module as a linked set, then it is shown 3 times, once for each of Gen
> 1:1, Gen 1:2 and Gen 1:3. It should not do this, but there are no methods in
> Key that will help determine that Gen 1:1 is part of a link set that
> encompasses verses 2 and 3. Then in parallel view we would set the colspan
> attribute to 3 for Gen 1.1 and when showing the verse number we'd show 1-3.
> Then when processing 2 and 3 we'd skip them as they link back to 1.
>
> So to answer your bullet, JSword at a minimum needs (the names and
> signatures can change)
> /**
> * Determine whether this and another reference are aliases of each other.
> That is do the have the same getRawText().
> * @param Key the key to compare.
> * @return true of this is an alias of that
> abstract boolean isAliased(Key that);
> /**
> * Get the Key representing a range of adjacent, aliased keys for a single
> Key.
> * This operation is potentially very expensive.
> * @return the aliased set.
> * @throws IllegalArgumentException if this Key has a cardinality > 1.
> */
> abstract Key getAliases();
>
> Then with these primitives, one can determine overlap.
>
>
>
>> 2- Storing the beginning and end of each section say as a group of sub
>> references { (book, start_chapter, start_verse, end_chapter, end_verse)+ }.
>> In this case a reference would be many of the previous definition. I can
>> see how we could work it out here, but i can see also having to do lots of
>> index range scans on our database
>>
>
> Working backwards, there needs to be a table to hold referenced pieces of
> text. Something like:
> create table BiblicalText (
> location NUMBER NOT NULL,
> text CLOB NOT NULL,
> CONSTRAINT BT_pk PRIMARY KEY (location)
> );
>
> Then there needs to be a way to get a BiblicalText getting a Verse.
> Something like:
> create table VerseReference (
> ordinal NUMBER NOT NULL, -- Represents the canonical order of the verse
> in a Bible, where 1 is the first verse, 2 is the next and so on.
> reference STRING NOT NULL, -- a normalized reference for the Verse, e.g.
> an osisID or a tuple(t,b,c,v)
> location NUMBER NOT NULL,
> CONSTRAINT bt_fk FOREIGN KEY (location) REFERENCES
> BiblicalText(location)
> );
>
> While JSword has not exposed it yet, you might want to represent testament,
> book and chapter introductions as specially named references. SWORD uses the
> tuple approach with something like:
> (Testament, Book, Chapter, Verse)
> (0,x,x,x) -- OT Introduction
> (1,0,x,x) -- NT Intro
> (1,1,0,x) -- Book intro to Matt
> (1,1,1,0) -- Chapter intro to Matt 1
> Note, The x values should be 0, but really are don't care values. JSword
> will be following this notion, which will squeeze in more bits into the bit
> maps.
>
> I think versification needs to be held separately from storage and
> essentially is an optimization of it. If you look at
> o.c.j.book.versification.BookInfo you'll see the fundamental way JSword maps
> a number to a verse reference.
> This will be extended to other fixed canons. Essentially one needs to know
> the order and names of books, the number of chapters per book and the number
> of verses per chapter. Not as actuals but as potentials. That is, a Book
> which is a translation of an old, damaged scroll might not have all verses
> in a chapter or even all chapters. The map should represent the undamaged
> scroll.
>
> This argues for an ordered list of Books and the number of chapters in each
> book.
> create table BookNames (
> bookNumber NUMBER NOT NULL,
> bookName STRING NOT NULL,
> chapterCount NUMBER NOT NULL,
> CONSTRAINT bn_pk PRIMARY KEY (bookNumber)
> );
>
> And the number of verses in each chapter:
> create table Chapter (
> bookNumber NUMBER NOT NULL,
> chapterNumber NUMBER NOT NULL,
> chapterSize NUMBER NOT NULL,
> CONSTRAINT ch_bn_fk FOREIGN KEY (bookNumber) REFERENCES
> BookNames(bookNumber)
> );
>
> At this time JSword is stuck on the KJV versification, so going beyond that
> will be a re-implementation of BookInfo. This is planned, but has not
> happened yet. Basically, we'll have file representation that JSword can read
> in to answer the questions. In essence, we'll use a file as the database.
> The form probably will be a serialization of a populated BookInfo object.
>
>
>
>> 3- Number each verse of the bible from 1 to 30000 or something like that,
>> and then workout and store each verse that is included in the reference in a
>> table somewhere in the database
>> The benefit here seems to be that we would get lots of unique index
>> lookups, but maybe the number of lookups would actually be better off doing
>> range scans... Also, we would have quite a bit more disk space overhead, if
>> we're storing a row for each verse.
>>
>> 4- Number each verse of the bible from 1 to 30000 or something like that,
>> but only store the ranges say verses 30-140 + verses 1500-1512
>>
>> 5- Numbering each verse, but keying the numbers by book, say Exodus verse
>> 750, 751, 752, etc.
>>
>> 6- We have the benefit of working in Java with a Java database, and so
>> could write a Java stored procedure to parse whatever solution comes out of
>> here... It would have to be fast though, given the end product is the web,
>> and the main activity will searching across scripture references...
>>
>
> I'm a bit confused here. Your discussion is that of a relational database,
> but you refer to a Java database. By a Java database, do you mean a
> relational DBMS that is addressed via JDBC or do you mean an OODBMS, such as
> Objectivity or ObjectStore? These days, next to no one uses an OODBMS.
>
>
>
>
>> We'll be doing many searches on our data and updating it only very rarely.
>> I'm a bit at a loss as to best way of doing this... How does JSword cope
>> with this? or does it uniquely do scripture lookups and not scripture
>> overlap? (ie. working out whether two portions of scripture overlap with
>> each other).
>>
>
> Having said all that. I have another idea. It is not well thought out. How
> about creating a JDBC driver for a SWORD module? Still use Lucene to create
> a secondary index for searching. The trick would be to map prepared
> statements to JSword calls.
>
> SWORD is a static database of all kinds of books. It is highly optimized
> for lookups. It consists of index and data files per testament.
>
> The index can be thought of an array indexed by verse number (This is the
> same as VerseReference.ordinal in the above table), holding a pointer
> (offset and size) into the data file. If two entries in the index point to
> the same location in the data file, then they are aliases/links.
>
> Since updates are infrequent, this works well. When a verse is updated, it
> is appended to the data file and all the references to it are updated to
> point to the new location. Over time, the data file will grow and should be
> rebuilt from scratch when it gets "too big".
>
>
>
>
>> Any ideas anyone?
>> Chris
>>
>> _______________________________________________
>> jsword-devel mailing list
>> jsword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/jsword-devel
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/jsword-devel/attachments/20091109/93df0f54/attachment-0001.html>
More information about the jsword-devel
mailing list