[osis-users] Unambiguous and Consistent OSIS for Interchange: Stand-off Markup
Efraim Feinstein
efraim.feinstein at gmail.com
Sun Jan 24 15:21:23 MST 2010
Hi,
While I'm reasonably convinced JLPTEI's approach is good for an archival
format, I'm not convinced it's a good interchange format. Interchange
of the type that APIs do takes place on a much more ad-hoc basis, and
the amount of processing that's needed to get a stand-off solution
working may not be worth it. One thing to consider is that a lot of the
API consumers are going to be JavaScript apps and their XML handling
facilities are not particularly good.
From this discussion, it seems that the interchange format with overlap
doesn't have to be valid OSIS. Why not use an technique like
fragmentation/reconstitution
<http://www.tei-c.org/release/doc/tei-p5-doc/en/html/NH.html#NHVE>?
It's what the Open Siddur XSLT transforms output for producing
displayable text.
Incidentally, the lack of support for concurrent hierarchies is not the
only reason we didn't use OSIS. The OSIS schema doesn't have the tags
to represent some of the structures we need to represent (which is not
surprising -- it was made for bibles, not liturgy) and it has no defined
extension mechanism.
-Efraim
Weston Ruter wrote:
> To follow up again, here is the Open Siddur project's writeup on the
> XML schema their came up with (JLPTEI) and why they didn't go with
> OSIS. The problem of concurrent hierarchies was a major concern:
>
> The primary question then becomes: which structure should be
> encoded? Prose can be divided into paragraphs and sentences,
> poetic text can be divided into line groups and verse lines, lists
> into items and lists, etc. Many parts of the /siddur/ have more
> than one structure on the same text! XML assumes that a document
> has a pure hierarchical tree structure. This suggests that XML is
> not an appropriate encoding technology for the /siddur/. At the
> same time, XML encoding is nearly universally standard and more
> software tools support XML-based formats than other encoding
> formats. One of the primary innovations of JLPTEI is its
> particular encoding of concurrent structural hierarchies. While
> the idea is not novel, the implementation is. The potential for
> the existence of concurrent structure is a guiding force in JLPTEI
> design.
>
> The disadvantage of JLPTEI's encoding solutions is that the
> archival form of the text is not immediately consumable by humans.
> We are forced to rely extensively on processing software to make
> the format editable and displayable. The disadvantage, however, is
> balanced by the encoding format's extensibility and conservation
> of human labor.
>
> The Open Siddur intends to work within open standards whenever
> possible. In choosing a basis for our encoding, we searched for
> available encoding standards that would suit our purposes. We
> seriously considered using Open Scripture Information Standard
> <http://bibletechnologies.net/> (OSIS), an XML format used for
> encoding bibles. It was quickly discovered that representations of
> some of the more advanced features required to encode the liturgy
> (such as those discussed above) would have to be "hacked" on top
> of the standard. The Text Encoding Initiative
> <http://www.tei-c.org/> (TEI) XML format is a de-facto standard
> within the digital humanities community. It is also is specified
> in well-documented texts, is actively supported by tools, and has
> a large community built around its use and development. Further,
> the standard is deliberately extensible using a relatively simple
> mechanism. The TEI was therefore a natural choice as a basis for
> our encoding.
>
> From <http://wiki.jewishliturgy.org/JLPTEI>
>
> On Sun, Jan 24, 2010 at 12:37 AM, Weston Ruter <westonruter at gmail.com
> <mailto:westonruter at gmail.com>> wrote:
>
> Attached is an example of what the ESV could look like as the
> result of a web service API response for 1 John 5:7-8, including
> virtual elements and stand-off markup. The relevant fragment:
>
> <concurrent>
> <!--
> @virtual can be "start", "end", "both", or "none" (default)
> target attribute used by Open Siddur; Efraim Feinstein notes
> range()
> is a TEI-defined XPointer scheme:
> http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATS
> Alternative would be to use @sID and @eID
> -->
> <p virtual="both" target="#range(w6200500701, w6200500812)"
> /><!--sID="w6200500701" eID="w6200500706b"-->
> <verse osisID="1John.5.7" target="#range(h6200500601,
> p6200500706)" /><!--sID="w6200500701" eID="p6200500706"-->
> <verse osisID="1John.5.8" target="#range(w6200500801,
> p6200500812)" /><!--sID="w6200500801" eID="p6200500812"-->
> </concurrent>
> <content><!-- isn't @scope="1John.5.7-1John.5.8" redundant here? -->
> <title ID="h6200500601" canonical="false"
> virtual="true">Testimony Concerning the Son of God</title>
> <w ID="w6200500701">For</w>
> <w ID="w6200500702">there</w>
> <w ID="w6200500703">are</w>
> <w ID="w6200500704">three</w>
> <w ID="w6200500705">that</w>
> <w ID="w6200500706">testify</w><w ID="p6200500706">:</w>
> <w ID="w6200500801">the</w>
> <w ID="w6200500802">Spirit</w>
> <w ID="w6200500803">and</w>
> <w ID="w6200500804">the</w>
> <w ID="w6200500805">water</w>
> <w ID="w6200500806">and</w>
> <w ID="w6200500807">the</w>
> <w ID="w6200500808">blood</w><w ID="p6200500808">;</w>
> <w ID="w6200500809">and</w>
> <w ID="w6200500810">these</w>
> <w ID="w6200500811">three</w>
> <w ID="w6200500812">agree</w><w ID="w6200500812">.</w>
> </content>
>
>
>
>
> On Thu, Jan 21, 2010 at 9:40 AM, Weston Ruter
> <westonruter at gmail.com <mailto:westonruter at gmail.com>> wrote:
>
> Troy:
>
> I did say that since OSIS allows different ways to mark
> the same structure, we have an importer which attempts to
> accept any valid OSIS doc and _normalizes_ that doc into a
> form of OSIS we find easiest for our engine to process.
> It is still OSIS, just a form of OSIS with all structures
> represented in a single way.
>
>
> Thank you for clarifying this, and also for sharing some of
> this history behind the development of OSIS.
>
> [We chose to] augment the specification with a 'best
> practices' doc which recommends a single specific method
> for encoding OSIS.
>
>
> I don't think I have seen this best practices doc. Is this
> something you use internally at CrossWire as part of your
> importer script? Could you direct me to it? I like the
> approach you took, allowing varying OSIS encodings but
> recommending only one of them. This is similar to the
> development of XHTML 1.0 dialects, where you are allowed to
> use the Transitional doctype, but the Strict doctype is
> recommended. Doing this for OSIS could answer the need for an
> unambiguous single markup language. The best practices
> document would need to contain the practices that are endorsed
> by at least the majority of players; the others could abstain
> and still use their preferred (deprecated) dialect of OSIS.
> Along with this best practices doc, an official normalizer
> script that translates OSIS into the recommended encoding
> would be great.
>
> I look forward to your thoughts about stand-off markup
> encoding of OSIS, especially how well it might serve as the
> new recommended way to unambiguously encode OSIS.
>
> Thanks!
> Weston
>
>
> 2010/1/19 Troy A. Griffitts <scribe at crosswire.org
> <mailto:scribe at crosswire.org>>
>
> Weston Ruter wrote:
>
> ... Troy, as you've said before, you can't actually
> use OSIS as your raw data format at CrossWire because
> an OSIS document can be authored in many different
> ways and so there is much more programming logic that
> is needed to handle all of the possible OSIS styles.
>
>
> Hey Weston,
>
> Hope to have time for a thoughtful response to more of
> your suggestions, but just wanted to clear a couple things
> up first:
>
> I hope I never implied that we can't/don't use OSIS
> internally as our primary markup standard.
>
> I did say that since OSIS allows different ways to mark
> the same structure, we have an importer which attempts to
> accept any valid OSIS doc and _normalizes_ that doc into a
> form of OSIS we find easiest for our engine to process.
> It is still OSIS, just a form of OSIS with all structures
> represented in a single way.
>
> Even so, we still don't use any plain text format as our
> "raw data format". We typically compress and index
> documents when they are imported into our engine. You can
> ask our engine for OSIS, HTML, RTF, GBF, ThML, or
> plaintext and it will do its best to give you the data in
> the requested format.
>
> None of this to argue against your point: OSIS has
> multiple ways to encode a single structure in a document.
>
> The real answer to this is not technical. I too am
> frustrated with this. But many people working at many
> organizations were consulted when developing the OSIS
> specification. They gave great insights to how they work.
> Sometimes they even made demands with an ultimatum that
> they would absolutely not use the specification if a
> certain feature was not added to the spec.
>
> OSIS could have been technically finished in less than a
> year. It took us 3 years to get buy-in from all the
> participating organizations.
>
> In the end, the purpose of OSIS was to build collaboration
> between organizations. We could have developed a much
> easier to use technical specification which no one would
> have used, or conceded to demands to gain buy-in, and
> augment the specification with a 'best practices' doc
> which recommends a single specific method for encoding
> OSIS. We chose the later.
>
> Implementing code against the spec now, it makes our
> importer a pain in the butt to write, but in the end, we
> get what we want: a single OSIS style that our engine
> knows how to work with, and multiple supporting
> organizations producing OSIS documents.
>
>
> Troy.
>
>
>
>
> If we could define a single document structure, however, one
>
> that is a subset of the freedom that OSIS provides
> (perhaps taking cues from OXES), we could then have an
> XML format for scripture that would be suited for
> efficient interchange and application traversal.
>
> Currently we have the problem of two overlapping
> hierarchies: BSP and BCV. However, there could be
> potentially multiple versification systems, so there
> could be even more than two overlapping hierarchies,
> probably why the <p> element isn't currently
> milestonable. To get around the problem of overlapping
> hierarchies, what if we introduced stand-off markup
> into the equation? The words of scripture themselves
> could all be located in a flat structure as siblings;
> then in the header there could be multiple CONCUR
> sections (views) that list out the elements which
> belong to the various parts of the hierarchies
>
> For example, the current approach:
>
> <p>
> <verse osisID="Example.1.1" sID="Example.1.1" />
> <w id="w1">Then</w>
> <w id="w2">he</w>
> <w id="w3">said</w><w id="p1">,</w>
> <q marker="“" sID="Example.1.1.q1" />
> <w id="w4">Let</w>
> <w id="w5">us</w>
> <w id="w6">go</w><w id="p2">...</w>
> </p>
> <p>
> <w id="w7">but</w>
> <verse eID="Example.1.1" />
> <verse osisID="Example.1.2" sID="Example.1.2"/>
> <w id="w8">don't</w>
> <w id="w9">forget</w>
> <w id="w10">your</w>
> <w id="w11">backpack</w><w id="p3">.</w>
> <q marker="”" eID="Example.1.1.q1" />
> <verse eID="Example.1.2" />
> </p>
>
>
>
> Could instead appear as (I'm making up these element
> names):
>
> <concur>
> <view type="verse" osisID="Example.1.1"
> xpointer="range(#w1, #w7)" />
> <view type="verse" osisID="Example.1.2"
> xpointer="range(#w8, #q2)" />
> <view type="quote" xpointer="range(#q1, #q2)" />
> <view type="para" xpointer="range(#w1, #p2)" />
> <view type="para" xpointer="range(#w7, #q2)" />
> </concur>
> <content>
> <w id="w1">Then</w>
> <w id="w2">he</w>
> <w id="w3">said</w><w id="p1">,</w>
> <w id="q1">“</w><w id="w4">Let</w>
> <w id="w5">us</w>
> <w id="w6">go</w><w id="p2">...</w>
> <w id="w7">but</w>
> <w id="w8">don't</w>
> <w id="w9">forget</w>
> <w id="w10">your</w>
> <w id="w11">backpack</w><w id="p3">.</w><w
> id="q2">”</w>
> </content>
> By structuring a document like this, multiple
> overlapping hierarchies can be cleanly defined,
> although they are separated from the underlying
> content: this however, provides the benefit of
> clearing up the confusion as to where the <verse>,
> <p>, and <q> elements should be placed: in the concur
> section, they each can share references to the same
> content elements and so their boundaries are specified
> at the exact same location. This means that XML
> processors would be able to consistently handle each
> of the hierarchies as they interweave throughout the
> content data.
>
> Efraim Feinstein and James Tauber introduced me to
> this approach to structuring markup. See also:
> http://www.tei-c.org/Guidelines/P4/html/NH.html#NHCO
>
> Weston
>
>
>
>
>
--
---
Efraim Feinstein
Lead Developer
Open Siddur Project
http://opensiddur.net
http://wiki.jewishliturgy.org
More information about the osis-users
mailing list