[jsword-devel] size of OSIS chunks
dmsmith555 at yahoo.com
Fri Jun 17 13:51:55 MST 2005
On http://www.crosswire.org/sword/develop/swordmodule/ it says:
"In SWORD, for modules encoded with ThML and OSIS, each verse,
dictionary entry, and book division needs to be well-formed XML or it
will result in display problems in some frontends."
We take this statement as a requirement on all module makers. If a
module is not marked up in this fashion, then it is in error.
<aside> Unfortunately, the KJV which is the most downloaded module has
very bad xml for the New Testament. And it also is full of bad OSIS,
when it is well-formed. This makes BibleDesktop/JSword look bad.</aside>
If this is not so, then JSword will have problems. The reason is simple:
JSword is based on individual verses that are parsed from a string into
DOM for ThML and OSIS. This fragment is then added into the document
that the user requested. We use xml parsers to do the parsing. As such
they are required to fail on input that is not well-formed. This means
that each verse must be well formed. When it is not well-formed, we
start stripping stuff out of the input string until it parses. The
result is very messy.
I have posted questions on how verses should be marked and how the
context of a verse should be indicated, where the context of a verse is
a well formed container that contains the verse.
The verse at a time characteristic is at the heart of how Search works
in Bible Desktop. The user can either search for verses or request
verses directly and see only those verses.
Because of this fundamental behavior, we have also used it to assemble
whole ranges of verses (like chapters). That is, we don't get a chapter
as a string and then convert it into DOM. We convert a verse at a time
to DOM and append it to the DOM document that we are creating.
Chris has recommended for OSIS that chapters and verses be milestoned
elements and that document structure (books, sections, paragraphs, line
groups, ...) be the dominant structure. I think this makes sense. Though
it makes it harder for an xslt writer.
When an element ends in a verse but does not begin in a verse, or when
an element starts in a verse but does not end in a verse then we need to
know what is missing and where it can be found. Or we try to find it our
selves. I outlined an algorithm in a previous note that figures out
whether there is a missing begin or end tag (or both) and then fetches
adjacent verses as needed until a well-formed chunk is found, some
threshold is reached (or it is too painful to continue). When we reach a
threshold, then we need to fall back to stripping text.
If we knew that a verse had an unmatched </div> then we could slap a
<div> on the front and be able to parse it well. And if it had a <p> but
not </p> then we could slap a </p> on the end. While the result would
not be optimal, it would be workable.
The other need for verses to be well-formed is in creating a search
index. For JSword we want to index the verse text independently from
other text that is in the verse (e.g. notes, titles, strongs numbers
[which are attributes], ...) We may also want to index these separately.
If the verse is not well-formed then we cannot know for certain what is
verse text and what is not.
If you look at the Sword API for creating a Lucene index, you will find
that it has the same need.
Daniel Glassey wrote:
>Just while we are thinking about storage - how do you handle the fact
>that what you get from the module isn't necessarily wellformed XML.
>How big a chunk do you take at a time from the modules and how do you
>If you've noticed my post to sword-devel about multiple views it's
>that that I'm thinking about here. I'm thinking that the core chunk
>that gets extracted should be the 'section' as I would expect (and
>hope) that that should be wellformed. Just as long as poetry and
>suchlike don't straddle sections, or at least that they close
>jsword-devel mailing list
>jsword-devel at crosswire.org
More information about the jsword-devel