[osis-users] Unambiguous and Consistent OSIS for Interchange: Stand-off Markup
Weston Ruter
westonruter at gmail.com
Sun Jan 24 19:16:42 MST 2010
FYI, here's something TEI says about Stand-off Markup:
It has been noted that stand-off markup has several advantages over embedded
> annotations. In particular, it is possible to produce annotations of a text
> even when the source document is read-only. Furthermore, annotation files
> can be distributed without distributing the source text. Further advantages
> mentioned in the literature are that discontinuous segments of text can be
> combined in a single annotation, that independent parallel coders can
> produce independent annotations, and that different annotation files can
> contain different layers of information. Lastly, it has also been noted that
> this approach is elegant.
>
> But there are also several drawbacks. First, new stand-off annotated layers
> require a separate interpretation, and the layers — although separate —
> depend on each other. Moreover, although all of the information of the
> multiple hierarchies is included, the information may be difficult to access
> using generic methods.
>
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHSO
On Sun, Jan 24, 2010 at 1:53 PM, Weston Ruter <westonruter at gmail.com> wrote:
> To follow up again, here is the Open Siddur project's writeup on the XML
> schema their came up with (JLPTEI) and why they didn't go with OSIS. The
> problem of concurrent hierarchies was a major concern:
>
>> The primary question then becomes: which structure should be encoded?
>> Prose can be divided into paragraphs and sentences, poetic text can be
>> divided into line groups and verse lines, lists into items and lists, etc.
>> Many parts of the *siddur* have more than one structure on the same text!
>> XML assumes that a document has a pure hierarchical tree structure. This
>> suggests that XML is not an appropriate encoding technology for the *
>> siddur*. At the same time, XML encoding is nearly universally standard
>> and more software tools support XML-based formats than other encoding
>> formats. One of the primary innovations of JLPTEI is its particular encoding
>> of concurrent structural hierarchies. While the idea is not novel, the
>> implementation is. The potential for the existence of concurrent structure
>> is a guiding force in JLPTEI design.
>>
>> The disadvantage of JLPTEI's encoding solutions is that the archival form
>> of the text is not immediately consumable by humans. We are forced to rely
>> extensively on processing software to make the format editable and
>> displayable. The disadvantage, however, is balanced by the encoding format's
>> extensibility and conservation of human labor.
>>
>> The Open Siddur intends to work within open standards whenever possible.
>> In choosing a basis for our encoding, we searched for available encoding
>> standards that would suit our purposes. We seriously considered using Open
>> Scripture Information Standard <http://bibletechnologies.net/> (OSIS), an
>> XML format used for encoding bibles. It was quickly discovered that
>> representations of some of the more advanced features required to encode the
>> liturgy (such as those discussed above) would have to be "hacked" on top of
>> the standard. The Text Encoding Initiative <http://www.tei-c.org/> (TEI)
>> XML format is a de-facto standard within the digital humanities community.
>> It is also is specified in well-documented texts, is actively supported by
>> tools, and has a large community built around its use and development.
>> Further, the standard is deliberately extensible using a relatively simple
>> mechanism. The TEI was therefore a natural choice as a basis for our
>> encoding.
>>
> From <http://wiki.jewishliturgy.org/JLPTEI>
>
>
> On Sun, Jan 24, 2010 at 12:37 AM, Weston Ruter <westonruter at gmail.com>wrote:
>
>> Attached is an example of what the ESV could look like as the result of a
>> web service API response for 1 John 5:7-8, including virtual elements and
>> stand-off markup. The relevant fragment:
>>
>> <concurrent>
>> <!--
>> @virtual can be "start", "end", "both", or "none" (default)
>> target attribute used by Open Siddur; Efraim Feinstein notes range()
>> is a TEI-defined XPointer scheme:
>> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATS
>> Alternative would be to use @sID and @eID
>> -->
>> <p virtual="both" target="#range(w6200500701, w6200500812)"
>> /><!--sID="w6200500701" eID="w6200500706b"-->
>> <verse osisID="1John.5.7" target="#range(h6200500601, p6200500706)"
>> /><!--sID="w6200500701" eID="p6200500706"-->
>> <verse osisID="1John.5.8" target="#range(w6200500801, p6200500812)"
>> /><!--sID="w6200500801" eID="p6200500812"-->
>> </concurrent>
>> <content><!-- isn't @scope="1John.5.7-1John.5.8" redundant here? -->
>> <title ID="h6200500601" canonical="false" virtual="true">Testimony
>> Concerning the Son of God</title>
>> <w ID="w6200500701">For</w>
>> <w ID="w6200500702">there</w>
>> <w ID="w6200500703">are</w>
>> <w ID="w6200500704">three</w>
>> <w ID="w6200500705">that</w>
>> <w ID="w6200500706">testify</w><w ID="p6200500706">:</w>
>> <w ID="w6200500801">the</w>
>> <w ID="w6200500802">Spirit</w>
>> <w ID="w6200500803">and</w>
>> <w ID="w6200500804">the</w>
>> <w ID="w6200500805">water</w>
>> <w ID="w6200500806">and</w>
>> <w ID="w6200500807">the</w>
>> <w ID="w6200500808">blood</w><w ID="p6200500808">;</w>
>> <w ID="w6200500809">and</w>
>> <w ID="w6200500810">these</w>
>> <w ID="w6200500811">three</w>
>> <w ID="w6200500812">agree</w><w ID="w6200500812">.</w>
>> </content>
>>
>>
>>
>>
>> On Thu, Jan 21, 2010 at 9:40 AM, Weston Ruter <westonruter at gmail.com>wrote:
>>
>>> Troy:
>>>
>>> I did say that since OSIS allows different ways to mark the same
>>>> structure, we have an importer which attempts to accept any valid OSIS doc
>>>> and _normalizes_ that doc into a form of OSIS we find easiest for our engine
>>>> to process. It is still OSIS, just a form of OSIS with all structures
>>>> represented in a single way.
>>>>
>>>
>>> Thank you for clarifying this, and also for sharing some of this history
>>> behind the development of OSIS.
>>>
>>> [We chose to] augment the specification with a 'best practices' doc which
>>>> recommends a single specific method for encoding OSIS.
>>>>
>>>
>>> I don't think I have seen this best practices doc. Is this something you
>>> use internally at CrossWire as part of your importer script? Could you
>>> direct me to it? I like the approach you took, allowing varying OSIS
>>> encodings but recommending only one of them. This is similar to the
>>> development of XHTML 1.0 dialects, where you are allowed to use the
>>> Transitional doctype, but the Strict doctype is recommended. Doing this for
>>> OSIS could answer the need for an unambiguous single markup language. The
>>> best practices document would need to contain the practices that are
>>> endorsed by at least the majority of players; the others could abstain and
>>> still use their preferred (deprecated) dialect of OSIS. Along with this best
>>> practices doc, an official normalizer script that translates OSIS into the
>>> recommended encoding would be great.
>>>
>>> I look forward to your thoughts about stand-off markup encoding of OSIS,
>>> especially how well it might serve as the new recommended way to
>>> unambiguously encode OSIS.
>>>
>>> Thanks!
>>> Weston
>>>
>>>
>>> 2010/1/19 Troy A. Griffitts <scribe at crosswire.org>
>>>
>>> Weston Ruter wrote:
>>>>
>>>>> ... Troy, as you've said before, you can't actually use OSIS as your
>>>>> raw data format at CrossWire because an OSIS document can be authored in
>>>>> many different ways and so there is much more programming logic that is
>>>>> needed to handle all of the possible OSIS styles.
>>>>>
>>>>
>>>> Hey Weston,
>>>>
>>>> Hope to have time for a thoughtful response to more of your suggestions,
>>>> but just wanted to clear a couple things up first:
>>>>
>>>> I hope I never implied that we can't/don't use OSIS internally as our
>>>> primary markup standard.
>>>>
>>>> I did say that since OSIS allows different ways to mark the same
>>>> structure, we have an importer which attempts to accept any valid OSIS doc
>>>> and _normalizes_ that doc into a form of OSIS we find easiest for our engine
>>>> to process. It is still OSIS, just a form of OSIS with all structures
>>>> represented in a single way.
>>>>
>>>> Even so, we still don't use any plain text format as our "raw data
>>>> format". We typically compress and index documents when they are imported
>>>> into our engine. You can ask our engine for OSIS, HTML, RTF, GBF, ThML, or
>>>> plaintext and it will do its best to give you the data in the requested
>>>> format.
>>>>
>>>> None of this to argue against your point: OSIS has multiple ways to
>>>> encode a single structure in a document.
>>>>
>>>> The real answer to this is not technical. I too am frustrated with
>>>> this. But many people working at many organizations were consulted when
>>>> developing the OSIS specification. They gave great insights to how they
>>>> work. Sometimes they even made demands with an ultimatum that they would
>>>> absolutely not use the specification if a certain feature was not added to
>>>> the spec.
>>>>
>>>> OSIS could have been technically finished in less than a year. It took
>>>> us 3 years to get buy-in from all the participating organizations.
>>>>
>>>> In the end, the purpose of OSIS was to build collaboration between
>>>> organizations. We could have developed a much easier to use technical
>>>> specification which no one would have used, or conceded to demands to gain
>>>> buy-in, and augment the specification with a 'best practices' doc which
>>>> recommends a single specific method for encoding OSIS. We chose the later.
>>>>
>>>> Implementing code against the spec now, it makes our importer a pain in
>>>> the butt to write, but in the end, we get what we want: a single OSIS style
>>>> that our engine knows how to work with, and multiple supporting
>>>> organizations producing OSIS documents.
>>>>
>>>>
>>>> Troy.
>>>>
>>>>
>>>>
>>>>
>>>> If we could define a single document structure, however, one
>>>>
>>>>> that is a subset of the freedom that OSIS provides (perhaps taking cues
>>>>> from OXES), we could then have an XML format for scripture that would be
>>>>> suited for efficient interchange and application traversal.
>>>>>
>>>>> Currently we have the problem of two overlapping hierarchies: BSP and
>>>>> BCV. However, there could be potentially multiple versification systems, so
>>>>> there could be even more than two overlapping hierarchies, probably why the
>>>>> <p> element isn't currently milestonable. To get around the problem of
>>>>> overlapping hierarchies, what if we introduced stand-off markup into the
>>>>> equation? The words of scripture themselves could all be located in a flat
>>>>> structure as siblings; then in the header there could be multiple CONCUR
>>>>> sections (views) that list out the elements which belong to the various
>>>>> parts of the hierarchies
>>>>>
>>>>> For example, the current approach:
>>>>>
>>>>> <p>
>>>>> <verse osisID="Example.1.1" sID="Example.1.1" />
>>>>> <w id="w1">Then</w>
>>>>> <w id="w2">he</w>
>>>>> <w id="w3">said</w><w id="p1">,</w>
>>>>> <q marker="“" sID="Example.1.1.q1" />
>>>>> <w id="w4">Let</w>
>>>>> <w id="w5">us</w>
>>>>> <w id="w6">go</w><w id="p2">...</w>
>>>>> </p>
>>>>> <p>
>>>>> <w id="w7">but</w>
>>>>> <verse eID="Example.1.1" />
>>>>> <verse osisID="Example.1.2" sID="Example.1.2"/>
>>>>> <w id="w8">don't</w>
>>>>> <w id="w9">forget</w>
>>>>> <w id="w10">your</w>
>>>>> <w id="w11">backpack</w><w id="p3">.</w>
>>>>> <q marker="”" eID="Example.1.1.q1" />
>>>>> <verse eID="Example.1.2" />
>>>>> </p>
>>>>>
>>>>>
>>>>>
>>>>> Could instead appear as (I'm making up these element names):
>>>>>
>>>>> <concur>
>>>>> <view type="verse" osisID="Example.1.1" xpointer="range(#w1, #w7)"
>>>>> />
>>>>> <view type="verse" osisID="Example.1.2" xpointer="range(#w8, #q2)"
>>>>> />
>>>>> <view type="quote" xpointer="range(#q1, #q2)" />
>>>>> <view type="para" xpointer="range(#w1, #p2)" />
>>>>> <view type="para" xpointer="range(#w7, #q2)" />
>>>>> </concur>
>>>>> <content>
>>>>> <w id="w1">Then</w>
>>>>> <w id="w2">he</w>
>>>>> <w id="w3">said</w><w id="p1">,</w>
>>>>> <w id="q1">“</w><w id="w4">Let</w>
>>>>> <w id="w5">us</w>
>>>>> <w id="w6">go</w><w id="p2">...</w>
>>>>> <w id="w7">but</w>
>>>>> <w id="w8">don't</w>
>>>>> <w id="w9">forget</w>
>>>>> <w id="w10">your</w>
>>>>> <w id="w11">backpack</w><w id="p3">.</w><w id="q2">”</w>
>>>>> </content>
>>>>> By structuring a document like this, multiple overlapping hierarchies
>>>>> can be cleanly defined, although they are separated from the underlying
>>>>> content: this however, provides the benefit of clearing up the confusion as
>>>>> to where the <verse>, <p>, and <q> elements should be placed: in the concur
>>>>> section, they each can share references to the same content elements and so
>>>>> their boundaries are specified at the exact same location. This means that
>>>>> XML processors would be able to consistently handle each of the hierarchies
>>>>> as they interweave throughout the content data.
>>>>>
>>>>> Efraim Feinstein and James Tauber introduced me to this approach to
>>>>> structuring markup. See also:
>>>>> http://www.tei-c.org/Guidelines/P4/html/NH.html#NHCO
>>>>>
>>>>> Weston
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/osis-users/attachments/20100124/ebbd1b1d/attachment-0001.html>
More information about the osis-users
mailing list