[sword-devel] Food for thought regarding OSIS and some of its alternatives...

Tue Feb 7 01:44:41 MST 2006

MPJ,
	Hello my friend.  It's good to hear from you.  It seems like 2/3rds of 
your issues with OSIS are having to do with <q>.  May I suggest patience 
to review what comes out of the last OSIS meeting back in December.  We 
had a good hard look at practical uses of <q>, and believe me, your 
concerns have been heard.

	I agree OSIS allows too many legal ways to markup the same text.

	I disagree that OSIS HAS TO BE too complex for people to use, or cannot 
fully capture any other Bible markup.  Worse case, you always have <seg> 
and x- attribute values.

	I disagree that OSIS has slowed down development here.  It COULD HAVE 
slowed down development here if we tried to actually work on our 
osis2mod converter to handle a broader range of legal OSIS markup, but 
up to now, we pretty much encode our OSIS texts the way we want and that 
pretty much is defined by what our OSIS importer expects.

	I think you hit the nail on the head when you talked about XML catering 
to a tree view, which is NOT what written document, like Bibles are 
marked up as.  This has posed the largest problem, in my opinion, for 
OSIS as an XML schema.

	Basically, to sum up and offer a challenge.  How, in legal XML, do you 
markup multiple overlapping hierarchies like:

paragraph markers
verse markers
chapter markers
quote markers (nested)
poetry lines
linguistic annotation
critical source apparatus

We struggled with this and came up with what you suggest in you paper: 
milestones.  And I think we've tried our best to make the milestoning 
syntax straightforward:

<verse osisID="Jn.1.1">
In the beginning...
</verse>

or:

<verse sID="uniqueID" osisID="Jn.1.1" />
In the beginning...
<verse eID="uniqueID" />

I agree that you lose much of the advantage of XML processing tools that 
depend on a DOM tree hierarchy.

I don't use XSLT, so I don't really care :)

I do, however, use Java and C++ to parse OSIS just fine-- at least the 
VALID OSIS that we use here.  And I'm quite happy with it.  In fact 
we've built some pretty amazing tools to publish these feature-rich 
documents online, e.g. http://crosswire.org/study/

We STILL support GBF and ThML, but I'm still on the OSIS bandwagon 
because I believe collaboration toward a common markup is invaluable. 
I'd rather support 1 complex markup than 3 different markups that all 
handle the easy stuff 3 different ways, and punt on the harder issues.

I appreciate and share your practical spirit.

	-Troy.

Kahunapule Michael Johnson wrote:
> 
>   Why Use OSIS When USFM and USFX Work Better?
> 
> /By (Kahunapule) Michael Johnson, //http://kahunapule.org/ 
> <http://kahunapule.org/>
> 
> /6 February 2006/
> 
> 
>     Conclusion
> 
> OSIS is a poor choice for a standard Scripture archiving, authoring, and 
> interchange format for members of the Forum of Bible Translation 
> Agencies. Its inadequacies can be patched, but it probably can't be made 
> truly good without violating backward compatibility constraints. Using 
> OSIS is likely to make software development efforts more costly and 
> slower than necessary. OSIS is not better than USFM, overall. I present 
> a viable XML alternative, below. It is likely that other options that 
> are better than OSIS exist or could be created. In the mean time, OSIS 
> should be considered experimental, and not used for production uses like 
> drafting, checking, publishing, or archiving of Scripture unless USFM 
> equivalents are kept up-to-date and stored along side of the OSIS texts.
> 
> 
>     My Biases
> 
> (If you don't know what the Open Scriptural Information Standard is, you 
> can stop reading, now, and ignore both that proposed standard and this 
> document.)
> 
> I have been asked to write about Scripture file formats and the 
> suitability of OSIS for use in the SIL PNG Branch and in EBT. Before I 
> begin, let me explain a little bit about my qualifications to comment on 
> OSIS, what I have done with OSIS, and why I have much more than a 
> passing interest in OSIS.
> 
> I have been interested in Electronic publication and distribution of the 
> Holy Bible in various translations since long before I started working 
> on such things full time. While working a day job as a senior software 
> engineer, I would work on weekends and evenings on Bible translation and 
> electronic distribution. Part of the fruit of that is a Public Domain 
> modern English translation of the Holy Bible that is distributed at 
> http://WorldEnglishBible.org <http://WorldEnglishBible.org/>, among 
> other places. (I still do volunteer work on that project from time to 
> time.) Before I knew “Standard Format (SF)” existed and before either 
> XML or the World Wide Web were widely known, I thought through the need 
> for such a format for my own work, and generated a Bible file format 
> that is similar to SF, but differs in some details. I still use that old 
> format (GBF) to generate HTML, PDF, RTF, and other formats. I learned 
> about SF in volunteering for WA UK in keyboarding Scriptures. Later, I 
> joined EBT and (via secondment to the PNG Branch) SIL. I have worked on 
> Bible translation-related software development, mostly, but I also 
> manage the department of the SIL PNG Branch responsible for Scripture 
> typesetting and Scripture archiving.
> 
> My interest and experience related to Scripture file formats is more 
> practical and experiential than theoretical, although I certainly do 
> apply information theory and best practices of software design in my 
> work. (My Master's Thesis was related to information theory, 
> specifically data compression and encryption.) I have written software 
> that reads and writes SF (and especially the UBS preferred dialect of 
> SF, USFM or Unified Standard Format Markup). I have also studied and 
> written software to handle XSEM and OSIS Scripture files. In that 
> process, I have gained some insights.
> 
> I monitor the progress of several open source Scripture-related 
> projects, as well as some of the SIL projects, although I concentrate 
> mostly on producing the tools I'm working on personally. Currently, my 
> main project is the Onyx Scripture Typesetting project. The idea of that 
> project, as well as its actual use, is simple, even though the 
> implementation was not. I provide a program that reads Unicode USFM 
> Scripture files and produces a Microsoft Word 2003 XML (WordML) document 
> that is essentially all typeset except for the front and back matter, 
> pictures, captions, and maps. Those can be inserted manually using 
> Microsoft Word's normal editing facilities. As an added bonus, XML tags 
> can be embedded in the WordML to allow a reverse transformation back to 
> USFM in some cases.
> 
> I started out being in favor of OSIS, and even tried to promote it. I 
> have since repented of that viewpoint, as it probably does more harm 
> than good by discouraging the development and use of much better 
> Scripture XML file schemas.
> 
> 
>     XML Myths Debunked
> 
> Myth 1: Anything in XML is inherently better for archiving and 
> processing than non-XML formats. *False.* XML is just a set of rules 
> defining how text files can define data, with tags, attributes, and 
> contents being easily separated and parsed. One disadvantage of XML is 
> that it forces strict nesting of elements, making it an awkward basis 
> for Biblical texts. (This shortcoming is easy to overcome using 
> milestones, which are empty elements that mark the beginning or end of 
> something. Unfortunately, there are some ways of doing that which are 
> error-prone and not elegant, like OSIS does.) Just because something is 
> in XML doesn't mean any software that can read XML can make sense out of 
> it. It all depends on the schema used. XML can, in fact, be made 
> arbitrarily obscure and arcane, intentionally or otherwise. It can be 
> encrypted, scrambled, and made to conform to illogical structures. XML 
> data can be arranged in useful or non-useful ways, or any combination 
> thereof.
> 
> Myth 2: XML is better than SF for processing because of the software 
> tools available for processing XML. *False.* There are some good tools 
> available for processing XML and transforming it to other formats, but 
> there are also pretty simple SF parsers available, too. Implementing the 
> latter is actually simpler than the former.
> 
> Myth 3: If data is expressed in XML, it can easily be transformed to 
> other formats. *False.* The data can only be transformed to other 
> formats if all required information for the target format is present in 
> the source format, and segregated with the same granularity. 
> Furthermore, the programming skills necessary to perform these 
> transformations are specialized knowledge that it is not reasonable to 
> expect the average computer user to be fluent in. (The “average” 
> computer user is probably challenged to understand a tree-structured 
> directory file system, let alone XSLT.)
> 
> That said, I like XML, and I like to use some of the software tools 
> available to read, parse, transform, and write XML. However, the 
> suitability of XML to a given task depends strongly on the schema and 
> application.
> 
> 
>     Why I Like USFM
> 
>    1.
> 
>       It is simple to understand, use, and program for. It is simple
>       enough to expect at least 50% of ordinary working linguists (OWLs)
>       to be able to understand and edit even in a plain text editor, at
>       least with the commonly-used features of it.
> 
>    2.
> 
>       It is well documented, and the documentation is maintained and
>       published in accessible formats (HTML and PDF in a way that is
>       easy to mirror on a notebook computer taken to a remote village).
>       The latest documentation is easy to find and clearly labeled with
>       its revision date.
> 
>    3.
> 
>       The maintainers of USFM are responsive to comments and mindful of
>       backward compatibility issues when they make changes.
> 
>    4.
> 
>       USFM is close enough to the (depreciated-but-still-used) PNG SFM
>       that updating to USFM is reasonably painless. (In most cases, just
>       a few global search-and-replace operations do it.)
> 
>    5.
> 
>       USFM is well-enough defined that it makes programming tools to
>       read and write USFM easier to create and maintain than doing the
>       same for generic SFM.
> 
>    6.
> 
>       USFM provides a real and practical measure of cross-entity
>       portability for Scripture texts, opening up more options for
>       typesetting, checking, and software tool creation and use.
> 
>    7.
> 
>       USFM takes full advantage of the time-tested practical aspects of
>       SFM in the experience of Bible translators from multiple
>       organizations, making incremental improvements where appropriate.
> 
>    8.
> 
>       USFM is a simple text-based, easy-to-parse format that is robust,
>       can be read by many software tools, and will not go obsolete due
>       to the obsolescence of any one software tool or company. It is
>       trustworthy for archiving purposes.
> 
>    9.
> 
>       USFM allows the unambiguous encoding of all essential elements of
>       Scripture texts that I'm interested in encoding, including every
>       PNG language, and for that matter, the Scriptures and essential
>       peripherals (footnotes, section titles, etc.) for any language I
>       anticipate encountering.
> 
>   10.
> 
>       In the unlikely event that USFM would be inadequate for a
>       particular language or translation, it would not be difficult to
>       extend it for whatever unusual circumstances might come up.
> 
>   11.
> 
>       USFM has good software support with Paratext, various Microsoft
>       Word macros, Adapt It, Onyx, and various other programs. Future
>       support is being developed in the JAARS Translation Editor.
> 
>   12.
> 
>       USFM is simple enough to program for that it can be used with low
>       power computing devices.
> 
>   13.
> 
>       USFM does nothing to force manual processing of legacy data to
>       make it conform to current standards. Automated conversions from
>       other dialects of SF are possible.
> 
>   14.
> 
>       By using USFM instead of PNG SFM, we get can take advantage of new
>       releases of Paratext style sheets, etc., without having to
>       customize them for our own dialect of SF (like we used to do with
>       PNG SFM).
> 
>   15.
> 
>       There is no problem encoding any of the common variants in
>       versification.
> 
>   16.
> 
>       USFM is mostly backward compatible with prior SF dialects,
>       separating data with the same granularity. In most cases, updating
>       to USFM is a simple matter of a few consistent changes of markers.
>       In the unlikely event you would want to do the reverse
>       transformation, that is easy, too.
> 
> 
>     What I Don't Like About USFM
> 
>    1.
> 
>       The current version of USFM as I read it and as implemented in
>       Paratext is ambiguous with respect to the end point of character
>       styles in some cases. I have given the nitty-gritty details to the
>       interested parties, and expect a wise resolution. In the mean
>       time, I have a work-around involving USFX for those cases where I
>       need it.
> 
>    2.
> 
>       USFM is not XML, so it can't be used where XML is required, such
>       as direct embedding in WordML or as input for an XSL
>       Transformation. My work-around for this is a direct conversion of
>       USFM to XML using the USFX schema, which has a very simple mapping
>       of XML elements to backslash codes in USFM, and which can
>       represent all the same data with no loss of information in the
>       conversion from USFM to USFX. USFX is documented at
>       http://ebt.cx/usfx/.
> 
>    3.
> 
>       USFM does not support footnote range start tags for easy hyperlink
>       generation, but most SIL members would never miss this function.
> 
> 
>     What I Like About OSIS
> 
>    1.
> 
>       It is XML.
> 
>    2.
> 
>       It seems to have at least theoretical support by a wide
>       representation of interested parties, and seems to have some
>       capable salesmen working to establish it as a standard.
> 
>    3.
> 
>       USFM data can be converted to OSIS automatically if you accept
>       some modifications to the OSIS documented standard, and if you
>       don't mind adding some metadata from other sources. It is a little
>       awkward, and may involve loss of some metadata, but it is possible.
> 
>    4.
> 
>       OSIS documents can be converted to USFM if you can accept some
>       potential loss of data, in the cases where either the quotation
>       punctuation rules are simple or where the generator of that text
>       modified OSIS to make lossless conversion possible.
> 
>    5.
> 
>       It allows drafting of Bible texts marking only the beginning and
>       end of quotations, without having to manually adjust punctuation
>       for nesting level and open quote reminders at stanza and paragraph
>       breaks when appropriate for a particular language and style;
>       promising that some process will later supply the actual punctuation.
> 
> 
>     What I Don't Like About OSIS
> 
>    1.
> 
>       The quotation and speech markup is incomplete with respect to
>       multiple languages and styles, making it impossible to be sure
>       that OSIS readers would generate and display the correct quotation
>       punctuation for a given translation without extra external
>       information. OSIS does not define or provide a way of providing
>       that extra information, nor is it obvious how that information
>       should be supplied. Therefore, OSIS files are not self-contained
>       with respect to all important Scripture meaning-based data like
>       USFM is.
> 
>    2.
> 
>       The latest documentation I read on OSIS indicated that it was
>       improper to put quotation punctuation directly in the text,
>       instead requiring it to be converted to markup-- a process that is
>       difficult, if not impossible to do automatically, especially
>       without detailed language-specific information.
> 
>    3.
> 
>       OSIS Scripture files are not self-contained with respect to all of
>       the meaning-based markup of the text, unlike USFM, except in some
>       simple cases.
> 
>    4.
> 
>       USFM and legacy SF texts cannot be fully automatically converted
>       to fully conformant OSIS with respect to quotation handling
>       without some serious manual intervention or language-specific
>       programming.
> 
>    5.
> 
>       OSIS has no mechanism for encoding “red letter” editions of Bibles
>       other than <q> tags, and those could be interpreted by OSIS
>       readers to mean that punctuation should be inserted, even if the
>       target language and style forbids such insertion.
> 
>    6.
> 
>       OSIS takes the control of quotation punctuation out of the hands
>       of the translators and gives it to the programmers who write the
>       programs that interpret the OSIS.
> 
>    7.
> 
>       OSIS does not support footnote range start tags for easy hyperlink
>       generation.
> 
>    8.
> 
>       Handling of minor variations in versification is awkward in OSIS.
>       Older attempts at documenting OSIS made a stab at handling this,
>       but currently published documentation doesn't even address this issue.
> 
>    9.
> 
>       OSIS parsing is unnecessarily complex mostly due to the fact that
>       it does not handle the overlapping of book/chapter/verse,
>       quotations, and book/section/paragraph or stanza/verse/line
>       hierarchies of Scripture texts well. It really has multiple ways
>       of handling these, and OSIS readers have to deal with all of them,
>       adding unnecessary complexity.
> 
>   10.
> 
>       Start/end tag matching identifiers are used where they really
>       wouldn't be required, and add unnecessary complexity to OSIS
>       generation. This isn't a big deal for program-generated OSIS, but
>       it is probably enough all by itself to push the complexity past
>       what most OWLs can handle error-free for manual OSIS generation
>       with a text editor.
> 
>   11.
> 
>       There is a fair amount of ambiguity in the OSIS standard, leading
>       to doubts about reliable compatibility between different software
>       products using OSIS to interchange data.
> 
>   12.
> 
>       The current OSIS standard is not easy to find on the OSIS web
>       site, and the documentation that is there is downlevel.
> 
>   13.
> 
>       OSIS has inadequate software support for drafting, checking, and
>       publishing Scriptures.
> 
>   14.
> 
>       I have yet to see reliable converters between OSIS and USFM. (I
>       have written an OSIS writer myself, but it was impossible to
>       complete without “cheating” on the OSIS standard a little, making
>       modifications that the OSIS committee seems to be unwilling to make.)
> 
>   15.
> 
>       The unnecessary complexity of OSIS means that software written to
>       read and write will be more expensive, take longer to write, and
>       probably contain more bugs than software written to a simpler
>       standard, even though a simpler standard could do anything OSIS
>       could do.
> 
>   16.
> 
>       The OSIS schema I used to program to when testing its suitability
>       could not handle simple things like supplied text (KJV italics)
>       within a Psalm title.
> 
>   17.
> 
>       OSIS is too complex to embed in WordML along with working typeset
>       text.
> 
>   18.
> 
>       OSIS could be made usable with some minor modifications, but there
>       is no indication that those modifications would ever be made.
> 
>   19.
> 
>       OSIS could never be made simple enough to be elegant and to save
>       on software development costs without sacrificing backward
>       compatibility. To really fix it, it would be better to replace it
>       and provide a conversion tool for legacy text. This, in turn,
>       raises doubts about OSIS' suitability as an archival format.
> 
>   20.
> 
>       In an environment where there has been a large perceived need for
>       an XML Scripture file interchange standard, OSIS has been around
>       for a very long time (in Internet years) without producing a
>       significant following among software developers or Bible
>       translators. There are a couple of notable exceptions (like The
>       Sword Project), but even then, I think that significantly slowed
>       development on that project.
> 
>   21.
> 
>       The mere thought that OSIS would be useful to us in the field with
>       the current set of support tools is laughable due to the overly
>       complex nature of that schema. OSIS is too complex for competent
>       programmers to fully grasp, let alone my typesetting staff.
>       Defining a “best practices” subset of OSIS is not sufficient to
>       fix this problem.
> 
>   22.
> 
>       I find some of the tools provided so far for OSIS editing to be
>       intimidating from a security and usability standpoint. For
>       example, I'm not willing to even test the OSIS editor Word 2003
>       plugin on a production machine because of the way it uses macros.
> 
>   23.
> 
>       Given all of the above, I consider OSIS to be dangerous, in that
>       it is consuming resources better applied elsewhere and
>       discouraging people from looking at alternatives.
> 
> 
>     What I Like About USFX
> 
>    1.
> 
>       All of the good things I said about USFM apply, because it is
>       essentially USFM converted straight to XML, and because XML is
>       also a simple text format.
> 
>    2.
> 
>       USFX is XML, so software tools like XSLT and XML parsing library
>       functions work with it.
> 
>    3.
> 
>       USFX is simple enough to embed in WordML.
> 
>    4.
> 
>       USFX extends USFM to allow generation of quotation punctuation
>       from markup, but does so in a way that keeps the control of that
>       punctuation in the control of the translators, not programmers who
>       don't know the language. It also provides a mechanism missing in
>       OSIS to allow Scripture file parsers who don't know the rules for
>       generating punctuation for a particular translation to just leave
>       in place what has already been generated. USFX readers with such
>       knowledge also know what generated punctuation to replace, place,
>       or remove when reprocessing an edited input file. (OSIS has no way
>       to do that, at least not without nonstandard extensions.)
> 
>    5.
> 
>       USFX extends USFM to get rid of the character style end ambiguity
>       of USFM.
> 
>    6.
> 
>       USFX could readily be extended to include elements of OSIS that
>       are not in USFM, like full Dublin Core metadata, if anyone cared
>       to make that an option. Alternatively, you could make a document
>       with mixed schemas, and just use DC + USFX.
> 
>    7.
> 
>       USFX is easy to convert to or from USFM with no loss of
>       information. Converters exist that work on Windows XP, Mac OS X,
>       and Linux. Even though USFX has virtually no following, now, USFM
>       is the most conservative, safe format to use for Scripture
>       processing, interchange, and archiving. Therefore, USFX inherits
>       USFM's ease of conversion from legacy SF texts.
> 
> 
>     What I Don't Like About USFX
> 
>    1.
> 
>       I invented it, first as an internal XML schema to be used within
>       the Onyx project, so people might think I am just rejecting OSIS
>       for NIH reasons or rabble rousing. (Actually, I tried to use OSIS
>       directly first, but that turned out to be a time sink for many
>       reasons, some of which are listed above.)
> 
>    2.
> 
>       USFX hasn't been subjected to much use, shake-out, and comment,
>       yet, so it still is probably shaky as an archiving format.
>       Therefore, if USFX is used, it should be converted to USFM and
>       stored along with USFM for archiving.
> 
> 
>     Other Bible XML Schemas
> 
> Some other options exist. Some, I have looked at. Some, I have not. How 
> sure do you want to be about the set of railroad tracks you lift your 
> locomotive onto?
> 
> 
>     What I Recommend
> 
>    1.
> 
>       Don't legislate OSIS as THE XML Scripture standard to use within
>       SIL or any of the Forum of Bible Translation Agencies. Please. It
>       isn't like we could actually use it, right now, anyway, because of
>       its flaws and lack of adequate software tool support.
> 
>    2.
> 
>       Look for better alternatives in XML Scripture encoding schemas.
>       Consider USFX, or better yet, improve on USFX or replace it with
>       something better. Don't waste more time on OSIS, except to study
>       what good ideas from it might be transferred to a better schema.
> 
>    3.
> 
>       Do not abandon USFM as a Scripture drafting, processing,
>       typesetting, and archiving standard until you have something
>       better, and then only if it is easy to fully automatically convert
>       from USFM to that new standard.
> 
>    4.
> 
>       Convert any Scripture texts that have been produced or archived in
>       OSIS back to something better, like USFM.
> 
> 
>     Further Reading
> 
> http://ebt.cx/usfx/Bible-encoding.htm
> 
> http://ebt.cx/usfx/
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page