[osis-editors] paragraphs vs. verses

Fri Jun 2 11:05:57 MST 2006

Hello,

I would like to voice a major concern I have about OSIS.

I've been reading various code and comments about OSIS and other
formats for Scripture as electronic media.  One point of controversy
seems to involve what is perceived as a difficulty, namely, that in
Scripture, paragraphs and verses may overlap.

I have a Ph.D. in Computer Science, and have been around a lot longer
than has XML.  I worked as a graduate student on a project that used
SGML as a common format for document conversion, before even HTML
existed.  So though I'm new to OSIS, I'm not at all new to the general
subject, or to the problem under consideration here.

To get to the point, OSIS makes a major mistake in my view.  To describe
it, let me first suggest some different terminology.  OSIS talks about
"milestones" and "milestoneable" elements.  "Milestoneable" is not even
a real word, which should suggest that the problem has not been well
enough understood: there is more meaningful terminology for the idea.
XML itself talks about "empty" or "minimal" vs. "non-empty" elements.
I will instead se the terms "marker" and "container", since it is closer
to the OSIS issue.

Surely, there is a basic semantic question about whether paragraph
markup simply marks, or contains, paragraph text, but it turns out not
to be an important question: the simple fact is that they are
interchangeable semantic views for all practical purposes.  Why?
Because paragraphs are _anonymous_.  Paragraphs can be formed by an
author either because they express a coherent thought, or because the
thought or thoughts a paragraph expresses differs enough from what
comes before or follows to separate the two with a paragraph break.
But this distinction doesn't matter, and interpretation of this sort
are not forced by the author, but left to the reader, in any event.

I.e., in technical terms, paragraph markup can be either "marker" or
"container", without loss of semantic content.

Verses, on the other hand, really are not semantic at all, but they
are never anonymous - verse identification exists in the syntactic
realm specifically for the purpose of identifying text.  Because of
this, it follows (indirectly) that verse identification as markup is
always de facto "container" markup.

To the extent that there is overlap, then, paragraph markup should
only "mark" so that verse identification can always "contain".

OSIS takes an opposing view, i.e., that verse identification may
mark but paragraph markup can never mark, but only contain.  Among
other things, this complicates rational key-oriented storage of
documents (e.g., databases) based on OSIS, and thus promises significant
wasted effort to accomodate OSIS.

Paragraphs are never "identified;" i.e., they have no "identity."  They
thus cannot _ever_ be the basis for a key-oriented storage scheme,
because they neither capture nor convey key information.  But by
weakening verse identification in favor of asserting that paragraphs
"contain," the prospect of using verse identification as key information
is greatly complicated by OSIS.

It would suffice both semantically and syntactically for paragraphs to
be noted soley by _end markers_, especially for Scripture.  This is a
simple but fully general scheme which loses absolutely no information.
Semantically, paragraphs are contextual, and the beginning of a
paragraph is inferred from context, in the absence of preceding text
(as is the end of a paragraph, in the absence of succeeding text).
Using only end markers for paragraphs would thus not conflict at all
with verse identification and containing verse markup, which necessarily
includes start tags as markup.

As XML, this would suggest that OSIS should only use tags to end
paragraphs, or even simpler, don't use ,, or even at all,
but use instead. I actually came across this issue by observing
Sword's failure to convert a Scripture text that was marked by 
tags to OSIS, and I was astoundeed, frankly, by this failure.

Semantically, "p" and "br" are largely redundant, unless one attaches
respective container vs. marker significance to them. But a distinction
is both arbitrary and cannot be enforced where both "p" and "br" are
allowed, and and can be used interchangeably where "marker"
semantics are applicable.

Semantically, moreover, any number of contiguous elements is
equivalent to a single , so there should be no problem, e.g., with
using on either side of a containing verse element.

If an identification scheme is ever applied to paragraphs (as might be
done arbitrarily, in fact), then this becomes a different scheme than
Scripture versification for identification, and the two schemes will
certainly conflict.  But such conflict cannot be resolved by treating
one or the other as if it is NOT a container scheme with identification.
A better approach would be, say, word-level identification as well,
i.e., identification at a finer level of
"resolution" that can be used by more than one coarser identification
scheme.  But this is problematic for Scripture because different
languages with different words are involved.  That being the case, it
would be so much simpler to take advantage of the opportunity to treat
paragraph markup as markers instead of containers, since no conflict
exists in this view because paragraphs in Scripture are entirely
anonymous, conveying no identifying information at all.  (Scriptural
annotation is a different subject, but doesn't conflict with verses
as containers and paragraphs as markers, generally, since in most
cases, marking semantics apply or decisions can be made arbitrarily
about where to place annotations.)

Finally, I would like to comment further on terminology.  I have seen
XML described as "hierarchical" in the discussion about OSIS and other
XML formats for scripture.   XML is indeed hierarchical in a strict
sense, but more generally, it is _context-free_.  The class of context-
free languages and grammars include _regular_ languages and grammars,
and regular languages and grammars are notably NOT hierarchical, and
are generally simpler as well.  This discussion has treated XML as if
it is incapable of dealing with regular languages and grammars, which
of course is not at all the case, and the distinction I make between
marker and container elements is not applicable to regular languages
and grammars, which are distinguished from context-free exactly by
their absence of container elements and use solely of marker elements.
I.e., marker semantics do not necessarily make a CF language or grammar
more complex, i.e., context-sensitive or harder, but more likely make
it regular and thus less complex (though possibly more tedious to
process, but not more difficult conceptually), if the analysis has been
appropriate.  In the OSIS case, it hasn't yet been appropriate.

I would normally avoid such problems in the interest of avoiding
conflicts and controversies, but my interests are hopefully the same
as those of the OSIS community.  I would only note that if such a
small issue as this is a stumbling block, it bodes not well where more
relevant issues of interpretation and translation of the actual text
of the Scripture are concerned.  Prayerfully, the OSIS community will
move forward as the Holy Spirit provides opportunity to do so, and my
comments will do more good than harm in that regard.

-John