[jsword-devel] size of OSIS chunks

DM Smith dmsmith555 at yahoo.com
Fri Jun 17 13:51:55 MST 2005

On http://www.crosswire.org/sword/develop/swordmodule/ it says:
"In SWORD, for modules encoded with ThML and OSIS, each verse, 
dictionary entry, and book division needs to be well-formed XML or it 
will result in display problems in some frontends."

We take this statement as a requirement on all module makers. If a 
module is not marked up in this fashion, then it is in error.

<aside> Unfortunately, the KJV which is the most downloaded module has 
very bad xml for the New Testament. And it also is full of bad OSIS, 
when it is well-formed. This makes BibleDesktop/JSword look bad.</aside>

If this is not so, then JSword will have problems. The reason is simple: 
JSword is based on individual verses that are parsed from a string into 
DOM for ThML and OSIS. This fragment is then added into the document 
that the user requested. We use xml parsers to do the parsing. As such 
they are required to fail on input that is not well-formed. This means 
that each verse must be well formed. When it is not well-formed, we 
start stripping stuff out of the input string until it parses. The 
result is very messy.

I have posted questions on how verses should be marked and how the 
context of a verse should be indicated, where the context of a verse is 
a well formed container that contains the verse.

The verse at a time characteristic is at the heart of how Search works 
in Bible Desktop. The user can either search for verses or request 
verses directly and see only those verses.

Because of this fundamental behavior, we have also used it to assemble 
whole ranges of verses (like chapters). That is, we don't get a chapter 
as a string and then convert it into DOM. We convert a verse at a time 
to DOM and append it to the DOM document that we are creating.

Chris has recommended for OSIS that chapters and verses be milestoned 
elements and that document structure (books, sections, paragraphs, line 
groups, ...) be the dominant structure. I think this makes sense. Though 
it makes it harder for an xslt writer.

When an element ends in a verse but does not begin in a verse, or when 
an element starts in a verse but does not end in a verse then we need to 
know what is missing and where it can be found. Or we try to find it our 
selves. I outlined an algorithm in a previous note that figures out 
whether there is a missing begin or end tag (or both) and then fetches 
adjacent verses as needed until a well-formed chunk is found, some 
threshold is reached (or it is too painful to continue). When we reach a 
threshold, then we need to fall back to stripping text.

If we knew that a verse had an unmatched </div> then we could slap a 
<div> on the front and be able to parse it well. And if it had a <p> but 
not </p> then we could slap a </p> on the end. While the result would 
not be optimal, it would be workable.

The other need for verses to be well-formed is in creating a search 
index. For JSword we want to index the verse text independently from 
other text that is in the verse (e.g. notes, titles, strongs numbers 
[which are attributes], ...) We may also want to index these separately. 
If the verse is not well-formed then we cannot know for certain what is 
verse text and what is not.

If you look at the Sword API for creating a Lucene index, you will find 
that it has the same need.

Daniel Glassey wrote:

>Just while we are thinking about storage - how do you handle the fact
>that what you get from the module isn't necessarily wellformed XML.
>How big a chunk do you take at a time from the modules and how do you
>handle it?
>If you've noticed my post to sword-devel about multiple views it's
>that that I'm thinking about here. I'm thinking that the core chunk
>that gets extracted should be the 'section' as I would expect (and
>hope) that that should be wellformed. Just as long as poetry and
>suchlike don't straddle sections, or at least that they close
>themselves first.
>jsword-devel mailing list
>jsword-devel at crosswire.org

More information about the jsword-devel mailing list