[osis-users] Unambiguous and Consistent OSIS for Interchange: Stand-off Markup

Weston Ruter westonruter at gmail.com
Sun Jan 24 19:16:42 MST 2010

FYI, here's something TEI says about Stand-off Markup:

It has been noted that stand-off markup has several advantages over embedded
> annotations. In particular, it is possible to produce annotations of a text
> even when the source document is read-only. Furthermore, annotation files
> can be distributed without distributing the source text. Further advantages
> mentioned in the literature are that discontinuous segments of text can be
> combined in a single annotation, that independent parallel coders can
> produce independent annotations, and that different annotation files can
> contain different layers of information. Lastly, it has also been noted that
> this approach is elegant.
> But there are also several drawbacks. First, new stand-off annotated layers
> require a separate interpretation, and the layers — although separate —
> depend on each other. Moreover, although all of the information of the
> multiple hierarchies is included, the information may be difficult to access
> using generic methods.

On Sun, Jan 24, 2010 at 1:53 PM, Weston Ruter <westonruter at gmail.com> wrote:

> To follow up again, here is the Open Siddur project's writeup on the XML
> schema their came up with (JLPTEI) and why they didn't go with OSIS. The
> problem of concurrent hierarchies was a major concern:
>> The primary question then becomes: which structure should be encoded?
>> Prose can be divided into paragraphs and sentences, poetic text can be
>> divided into line groups and verse lines, lists into items and lists, etc.
>> Many parts of the *siddur* have more than one structure on the same text!
>> XML assumes that a document has a pure hierarchical tree structure. This
>> suggests that XML is not an appropriate encoding technology for the *
>> siddur*. At the same time, XML encoding is nearly universally standard
>> and more software tools support XML-based formats than other encoding
>> formats. One of the primary innovations of JLPTEI is its particular encoding
>> of concurrent structural hierarchies. While the idea is not novel, the
>> implementation is. The potential for the existence of concurrent structure
>> is a guiding force in JLPTEI design.
>> The disadvantage of JLPTEI's encoding solutions is that the archival form
>> of the text is not immediately consumable by humans. We are forced to rely
>> extensively on processing software to make the format editable and
>> displayable. The disadvantage, however, is balanced by the encoding format's
>> extensibility and conservation of human labor.
>> The Open Siddur intends to work within open standards whenever possible.
>> In choosing a basis for our encoding, we searched for available encoding
>> standards that would suit our purposes. We seriously considered using Open
>> Scripture Information Standard <http://bibletechnologies.net/> (OSIS), an
>> XML format used for encoding bibles. It was quickly discovered that
>> representations of some of the more advanced features required to encode the
>> liturgy (such as those discussed above) would have to be "hacked" on top of
>> the standard. The Text Encoding Initiative <http://www.tei-c.org/> (TEI)
>> XML format is a de-facto standard within the digital humanities community.
>> It is also is specified in well-documented texts, is actively supported by
>> tools, and has a large community built around its use and development.
>> Further, the standard is deliberately extensible using a relatively simple
>> mechanism. The TEI was therefore a natural choice as a basis for our
>> encoding.
> From <http://wiki.jewishliturgy.org/JLPTEI>
> On Sun, Jan 24, 2010 at 12:37 AM, Weston Ruter <westonruter at gmail.com>wrote:
>> Attached is an example of what the ESV could look like as the result of a
>> web service API response for 1 John 5:7-8, including virtual elements and
>> stand-off markup. The relevant fragment:
>> <concurrent>
>>     <!--
>>     @virtual can be "start", "end", "both", or "none" (default)
>>     target attribute used by Open Siddur; Efraim Feinstein notes range()
>>     is a TEI-defined XPointer scheme:
>>     http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATS
>>     Alternative would be to use @sID and @eID
>>     -->
>>     <p virtual="both" target="#range(w6200500701, w6200500812)"
>> /><!--sID="w6200500701" eID="w6200500706b"-->
>>     <verse osisID="1John.5.7" target="#range(h6200500601, p6200500706)"
>> /><!--sID="w6200500701" eID="p6200500706"-->
>>     <verse osisID="1John.5.8" target="#range(w6200500801, p6200500812)"
>> /><!--sID="w6200500801" eID="p6200500812"-->
>> </concurrent>
>> <content><!-- isn't @scope="1John.5.7-1John.5.8" redundant here? -->
>>     <title ID="h6200500601" canonical="false" virtual="true">Testimony
>> Concerning the Son of God</title>
>>     <w ID="w6200500701">For</w>
>>     <w ID="w6200500702">there</w>
>>     <w ID="w6200500703">are</w>
>>     <w ID="w6200500704">three</w>
>>     <w ID="w6200500705">that</w>
>>     <w ID="w6200500706">testify</w><w ID="p6200500706">:</w>
>>     <w ID="w6200500801">the</w>
>>     <w ID="w6200500802">Spirit</w>
>>     <w ID="w6200500803">and</w>
>>     <w ID="w6200500804">the</w>
>>     <w ID="w6200500805">water</w>
>>     <w ID="w6200500806">and</w>
>>     <w ID="w6200500807">the</w>
>>     <w ID="w6200500808">blood</w><w ID="p6200500808">;</w>
>>     <w ID="w6200500809">and</w>
>>     <w ID="w6200500810">these</w>
>>     <w ID="w6200500811">three</w>
>>     <w ID="w6200500812">agree</w><w ID="w6200500812">.</w>
>> </content>
>> On Thu, Jan 21, 2010 at 9:40 AM, Weston Ruter <westonruter at gmail.com>wrote:
>>> Troy:
>>> I did say that since OSIS allows different ways to mark the same
>>>> structure, we have an importer which attempts to accept any valid OSIS doc
>>>> and _normalizes_ that doc into a form of OSIS we find easiest for our engine
>>>> to process.  It is still OSIS, just a form of OSIS with all structures
>>>> represented in a single way.
>>> Thank you for clarifying this, and also for sharing some of this history
>>> behind the development of OSIS.
>>> [We chose to] augment the specification with a 'best practices' doc which
>>>> recommends a single specific method for encoding OSIS.
>>> I don't think I have seen this best practices doc. Is this something you
>>> use internally at CrossWire as part of your importer script? Could you
>>> direct me to it? I like the approach you took, allowing varying OSIS
>>> encodings but recommending only one of them. This is similar to the
>>> development of XHTML 1.0 dialects, where you are allowed to use the
>>> Transitional doctype, but the Strict doctype is recommended. Doing this for
>>> OSIS could answer the need for an unambiguous single markup language. The
>>> best practices document would need to contain the practices that are
>>> endorsed by at least the majority of players; the others could abstain and
>>> still use their preferred (deprecated) dialect of OSIS. Along with this best
>>> practices doc, an official normalizer script that translates OSIS into the
>>> recommended encoding would be great.
>>> I look forward to your thoughts about stand-off markup encoding of OSIS,
>>> especially how well it might serve as the new recommended way to
>>> unambiguously encode OSIS.
>>> Thanks!
>>> Weston
>>> 2010/1/19 Troy A. Griffitts <scribe at crosswire.org>
>>> Weston Ruter wrote:
>>>>> ... Troy, as you've said before, you can't actually use OSIS as your
>>>>> raw data format at CrossWire because an OSIS document can be authored in
>>>>> many different ways and so there is much more programming logic that is
>>>>> needed to handle all of the possible OSIS styles.
>>>> Hey Weston,
>>>> Hope to have time for a thoughtful response to more of your suggestions,
>>>> but just wanted to clear a couple things up first:
>>>> I hope I never implied that we can't/don't use OSIS internally as our
>>>> primary markup standard.
>>>> I did say that since OSIS allows different ways to mark the same
>>>> structure, we have an importer which attempts to accept any valid OSIS doc
>>>> and _normalizes_ that doc into a form of OSIS we find easiest for our engine
>>>> to process.  It is still OSIS, just a form of OSIS with all structures
>>>> represented in a single way.
>>>> Even so, we still don't use any plain text format as our "raw data
>>>> format".  We typically compress and index documents when they are imported
>>>> into our engine.  You can ask our engine for OSIS, HTML, RTF, GBF, ThML, or
>>>> plaintext and it will do its best to give you the data in the requested
>>>> format.
>>>> None of this to argue against your point: OSIS has multiple ways to
>>>> encode a single structure in a document.
>>>> The real answer to this is not technical.  I too am frustrated with
>>>> this.  But many people working at many organizations were consulted when
>>>> developing the OSIS specification.  They gave great insights to how they
>>>> work.  Sometimes they even made demands with an ultimatum that they would
>>>> absolutely not use the specification if a certain feature was not added to
>>>> the spec.
>>>> OSIS could have been technically finished in less than a year.  It took
>>>> us 3 years to get buy-in from all the participating organizations.
>>>> In the end, the purpose of OSIS was to build collaboration between
>>>> organizations.  We could have developed a much easier to use technical
>>>> specification which no one would have used, or conceded to demands to gain
>>>> buy-in, and augment the specification with a 'best practices' doc which
>>>> recommends a single specific method for encoding OSIS.  We chose the later.
>>>> Implementing code against the spec now, it makes our importer a pain in
>>>> the butt to write, but in the end, we get what we want: a single OSIS style
>>>> that our engine knows how to work with, and multiple supporting
>>>> organizations producing OSIS documents.
>>>> Troy.
>>>> If we could define a single document structure, however, one
>>>>> that is a subset of the freedom that OSIS provides (perhaps taking cues
>>>>> from OXES), we could then have an XML format for scripture that would be
>>>>> suited for efficient interchange and application traversal.
>>>>> Currently we have the problem of two overlapping hierarchies: BSP and
>>>>> BCV. However, there could be potentially multiple versification systems, so
>>>>> there could be even more than two overlapping hierarchies, probably why the
>>>>> <p> element isn't currently milestonable. To get around the problem of
>>>>> overlapping hierarchies, what if we introduced stand-off markup into the
>>>>> equation? The words of scripture themselves could all be located in a flat
>>>>> structure as siblings; then in the header there could be multiple CONCUR
>>>>> sections (views) that list out the elements which belong to the various
>>>>> parts of the hierarchies
>>>>> For example, the current approach:
>>>>> <p>
>>>>>    <verse osisID="Example.1.1" sID="Example.1.1" />
>>>>>    <w id="w1">Then</w>
>>>>>    <w id="w2">he</w>
>>>>>    <w id="w3">said</w><w id="p1">,</w>
>>>>>    <q marker="“" sID="Example.1.1.q1" />
>>>>>        <w id="w4">Let</w>
>>>>>        <w id="w5">us</w>
>>>>>        <w id="w6">go</w><w id="p2">...</w>
>>>>> </p>
>>>>> <p>
>>>>>    <w id="w7">but</w>
>>>>>    <verse eID="Example.1.1" />
>>>>>    <verse osisID="Example.1.2" sID="Example.1.2"/>
>>>>>    <w id="w8">don't</w>
>>>>>    <w id="w9">forget</w>
>>>>>    <w id="w10">your</w>
>>>>>    <w id="w11">backpack</w><w id="p3">.</w>
>>>>>    <q marker="”" eID="Example.1.1.q1" />
>>>>>    <verse eID="Example.1.2" />
>>>>> </p>
>>>>> Could instead appear as (I'm making up these element names):
>>>>> <concur>
>>>>>    <view type="verse" osisID="Example.1.1" xpointer="range(#w1, #w7)"
>>>>> />
>>>>>    <view type="verse" osisID="Example.1.2" xpointer="range(#w8, #q2)"
>>>>> />
>>>>>    <view type="quote" xpointer="range(#q1, #q2)" />
>>>>>    <view type="para"  xpointer="range(#w1, #p2)" />
>>>>>    <view type="para"  xpointer="range(#w7, #q2)" />
>>>>> </concur>
>>>>> <content>
>>>>>    <w id="w1">Then</w>
>>>>>    <w id="w2">he</w>
>>>>>    <w id="w3">said</w><w id="p1">,</w>
>>>>>    <w id="q1">“</w><w id="w4">Let</w>
>>>>>    <w id="w5">us</w>
>>>>>    <w id="w6">go</w><w id="p2">...</w>
>>>>>    <w id="w7">but</w>
>>>>>    <w id="w8">don't</w>
>>>>>    <w id="w9">forget</w>
>>>>>    <w id="w10">your</w>
>>>>>    <w id="w11">backpack</w><w id="p3">.</w><w id="q2">”</w>
>>>>> </content>
>>>>> By structuring a document like this, multiple overlapping hierarchies
>>>>> can be cleanly defined, although they are separated from the underlying
>>>>> content: this however, provides the benefit of clearing up the confusion as
>>>>> to where the <verse>, <p>, and <q> elements should be placed: in the concur
>>>>> section, they each can share references to the same content elements and so
>>>>> their boundaries are specified at the exact same location. This means that
>>>>> XML processors would be able to consistently handle each of the hierarchies
>>>>> as they interweave throughout the content data.
>>>>> Efraim Feinstein and James Tauber introduced me to this approach to
>>>>> structuring markup. See also:
>>>>> http://www.tei-c.org/Guidelines/P4/html/NH.html#NHCO
>>>>> Weston
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/osis-users/attachments/20100124/ebbd1b1d/attachment-0001.html>

More information about the osis-users mailing list