[osis-editors] paragraphs vs. verses
Patrick Durusau
patrick at durusau.net
Wed Jun 7 13:16:57 MST 2006
John,
Appreciate your email but note that major architectural changes are
unlikely for the current version of OSIS.
Realize that OSIS was *not* drafted entirely the accordance with my
views or anyone else who was part of the effort. Had I drafted it alone
any number of issues would have had different resolutions and it would
be *weaker* as a result.
Why? The criteria for success was not technical excellence, although we
tried to do the best we could, but to reflect a common understanding
between some fairly heavy hitter markup folks and the Bible translation
community.
Any community based standard is going to have compromises that may or
may not reflect the "best" theoretical choices that could have been made.
We are going to try to produce a minor release later this month that has
corrections to typos and such in the schema and users manual. We hope to
manage one major release at the end of the year. If you have suggestions
for changes, please forward them to this list.
My personal goal is that by the time we reach OSIS 4.0 that all users
will have visual interfaces that "display" OSIS markup, or USFM (any
variety), as well as the finished product, as well as offering a WYSIWYG
editing interface.
For too long, we in the markup community have pushed knowledge of markup
when what users want is a finished product that reflects their choices.
In efforts such as Bible translation, I see no reason for the products
of such work to be delayed while translators learn one markup system or
another. Granted I want OSIS to be the underlying archival format
because it will preserve translations for reuse and repurposing, but
those are goals that should not impinge on getting translations out to
those waiting for them.
Hope you are having a great day!
Patrick
John Boyd wrote:
>Hello,
>
>I would like to voice a major concern I have about OSIS.
>
>I've been reading various code and comments about OSIS and other
>formats for Scripture as electronic media. One point of controversy
>seems to involve what is perceived as a difficulty, namely, that in
>Scripture, paragraphs and verses may overlap.
>
>I have a Ph.D. in Computer Science, and have been around a lot longer
>than has XML. I worked as a graduate student on a project that used
>SGML as a common format for document conversion, before even HTML
>existed. So though I'm new to OSIS, I'm not at all new to the general
>subject, or to the problem under consideration here.
>
>To get to the point, OSIS makes a major mistake in my view. To describe
>it, let me first suggest some different terminology. OSIS talks about
>"milestones" and "milestoneable" elements. "Milestoneable" is not even
>a real word, which should suggest that the problem has not been well
>enough understood: there is more meaningful terminology for the idea.
>XML itself talks about "empty" or "minimal" vs. "non-empty" elements.
>I will instead se the terms "marker" and "container", since it is closer
>to the OSIS issue.
>
>Surely, there is a basic semantic question about whether paragraph
>markup simply marks, or contains, paragraph text, but it turns out not
>to be an important question: the simple fact is that they are
>interchangeable semantic views for all practical purposes. Why?
>Because paragraphs are _anonymous_. Paragraphs can be formed by an
>author either because they express a coherent thought, or because the
>thought or thoughts a paragraph expresses differs enough from what
>comes before or follows to separate the two with a paragraph break.
>But this distinction doesn't matter, and interpretation of this sort
>are not forced by the author, but left to the reader, in any event.
>
>I.e., in technical terms, paragraph markup can be either "marker" or
>"container", without loss of semantic content.
>
>Verses, on the other hand, really are not semantic at all, but they
>are never anonymous - verse identification exists in the syntactic
>realm specifically for the purpose of identifying text. Because of
>this, it follows (indirectly) that verse identification as markup is
>always de facto "container" markup.
>
>To the extent that there is overlap, then, paragraph markup should
>only "mark" so that verse identification can always "contain".
>
>OSIS takes an opposing view, i.e., that verse identification may
>mark but paragraph markup can never mark, but only contain. Among
>other things, this complicates rational key-oriented storage of
>documents (e.g., databases) based on OSIS, and thus promises significant
>wasted effort to accomodate OSIS.
>
>Paragraphs are never "identified;" i.e., they have no "identity." They
>thus cannot _ever_ be the basis for a key-oriented storage scheme,
>because they neither capture nor convey key information. But by
>weakening verse identification in favor of asserting that paragraphs
>"contain," the prospect of using verse identification as key information
>is greatly complicated by OSIS.
>
>It would suffice both semantically and syntactically for paragraphs to
>be noted soley by _end markers_, especially for Scripture. This is a
>simple but fully general scheme which loses absolutely no information.
>Semantically, paragraphs are contextual, and the beginning of a
>paragraph is inferred from context, in the absence of preceding text
>(as is the end of a paragraph, in the absence of succeeding text).
>Using only end markers for paragraphs would thus not conflict at all
>with verse identification and containing verse markup, which necessarily
>includes start tags as markup.
>
>As XML, this would suggest that OSIS should only use <p/> tags to end
>paragraphs, or even simpler, don't use <p>,</p>, or even <p/> at all,
>but use <br/> instead. I actually came across this issue by observing
>Sword's failure to convert a Scripture text that was marked by <br/>
>tags to OSIS, and I was astoundeed, frankly, by this failure.
>
>Semantically, "p" and "br" are largely redundant, unless one attaches
>respective container vs. marker significance to them. But a distinction
>is both arbitrary and cannot be enforced where both "p" and "br" are
>allowed, and <p/> and <br/> can be used interchangeably where "marker"
>semantics are applicable.
>
>Semantically, moreover, any number of contiguous <p/> elements is
>equivalent to a single <p/>, so there should be no problem, e.g., with
>using <p/> on either side of a containing verse element.
>
>If an identification scheme is ever applied to paragraphs (as might be
>done arbitrarily, in fact), then this becomes a different scheme than
>Scripture versification for identification, and the two schemes will
>certainly conflict. But such conflict cannot be resolved by treating
>one or the other as if it is NOT a container scheme with identification.
>A better approach would be, say, word-level identification as well,
>i.e., identification at a finer level of
>"resolution" that can be used by more than one coarser identification
>scheme. But this is problematic for Scripture because different
>languages with different words are involved. That being the case, it
>would be so much simpler to take advantage of the opportunity to treat
>paragraph markup as markers instead of containers, since no conflict
>exists in this view because paragraphs in Scripture are entirely
>anonymous, conveying no identifying information at all. (Scriptural
>annotation is a different subject, but doesn't conflict with verses
>as containers and paragraphs as markers, generally, since in most
>cases, marking semantics apply or decisions can be made arbitrarily
>about where to place annotations.)
>
>Finally, I would like to comment further on terminology. I have seen
>XML described as "hierarchical" in the discussion about OSIS and other
>XML formats for scripture. XML is indeed hierarchical in a strict
>sense, but more generally, it is _context-free_. The class of context-
>free languages and grammars include _regular_ languages and grammars,
>and regular languages and grammars are notably NOT hierarchical, and
>are generally simpler as well. This discussion has treated XML as if
>it is incapable of dealing with regular languages and grammars, which
>of course is not at all the case, and the distinction I make between
>marker and container elements is not applicable to regular languages
>and grammars, which are distinguished from context-free exactly by
>their absence of container elements and use solely of marker elements.
>I.e., marker semantics do not necessarily make a CF language or grammar
>more complex, i.e., context-sensitive or harder, but more likely make
>it regular and thus less complex (though possibly more tedious to
>process, but not more difficult conceptually), if the analysis has been
>appropriate. In the OSIS case, it hasn't yet been appropriate.
>
>I would normally avoid such problems in the interest of avoiding
>conflicts and controversies, but my interests are hopefully the same
>as those of the OSIS community. I would only note that if such a
>small issue as this is a stumbling block, it bodes not well where more
>relevant issues of interpretation and translation of the actual text
>of the Scripture are concerned. Prayerfully, the OSIS community will
>move forward as the Holy Spirit provides opportunity to do so, and my
>comments will do more good than harm in that regard.
>
>-John
>
>
>_______________________________________________
>osis-editors mailing list
>osis-editors at bibletechnologieswg.org
>http://www.bibletechnologieswg.org/mailman/listinfo/osis-editors
>
>
>
>
>
--
Patrick Durusau
Patrick at Durusau.net
Chair, V1 - Text Processing: Office and Publishing Systems Interface
Co-Editor, ISO 13250, Topic Maps -- Reference Model
Member, Text Encoding Initiative Board of Directors, 2003-2005
Topic Maps: Human, not artificial, intelligence at work!
More information about the osis-editors
mailing list