[sword-devel] [osis-editors] Re: The death of OSIS?

Steven J. DeRose sderose at acm.org
Thu Aug 12 09:45:02 MST 2004

At 22:09 +1000 2004-08-11, Kahunapule Michael P. Johnson wrote:
>The problem I have with OSIS (at least the version of documentation 
>that I have) is that it does not encode enough information to 
>reliably reconstitute quotation mark punctuation for the range of 
>languages and Bible translations that I work with. It doesn't even 
>cover English properly. The reason is that you state in the 
>documentation that quotations should be marked with <q 
>who="Nameofspeaker" sID="someuniquething">....<q who="Nameofspeaker" 
>eID="someuniquething"> and NOT with the quotation marks. This is OK 
>for SOME situations; to wit: standard English texts using the same 
>quotation punctuation rules as the NIV, and Bible texts in languages 
>that happen to use the same characters and rules for quotation 
>marks. This is NOT OK for other situations; to wit: English texts 
>using different quotation mark styles (like the NASB) or no 
>quotation marks at all (like the KJV). It occurs to me that by just 
>ignoring <q> and <speech> altogether, I could put in the normal 
>quotation punctuation for the given language as Unicode characters 
>in the right places and be happy-- except for two things.

It may well be that we all made mistakes in the design of quotation 
handling in OSIS, but I assure you we considered a much wider range 
of cases than the English NIV or English. Some of us are of US 
origin, but even so I don't think we have any monolinguals among us.

There is a real tradeoff here -- are quotation marks conventional 
ways of marking a discourse phenomenon (let's call it "quotation" to 
keep things simple), or are they part of "the text"? That is not so 
straightforward as it seems to me you are suggesting. There were no 
quotation marks in the original texts of the Bible, so all the 
quotation marks are products of someone's interpretation.

Nevertheless, we all agree that OSIS markup has to provide enough 
information to get the formatted result that one wants.

Actually, let me clarify that a little: widow and orphan management 
is an important part of high-quality formatting: certainly part of 
"the formatted result that one wants." But surely it shouldn't be 
part of what OSIS encodes. This may seem obvious or trivial, but I 
have heard people criticize OSIS for just this: they look at a 
printed Bible someone produced from OSIS source using some formatting 
tool that doesn't do widowing well, and say "OSIS can't produce a 
good Bible" -- we must always keep in mind that there are at least 
two separate parts involved here: the markup and the engine that 
processes it.

>One is that I want to encode some (but not all) of the Bible texts 
>for "red letter" editions. Actually, I don't really mean to specify 
>that the words of Jesus have to be in red. I just want to mark the 
>direct quotes of Jesus in a way that makes it easy for those who 
>wish to present the Bible text to display the direct quotes of Jesus 
>in red (or some other distinctive way) if they want to. I don't even 
>care if people display Jesus' direct quotes in red or not, but I do 
>care that if they do, the markers are in the right places so that 
>the correct words are marked. I can use <q who="Jesus" 
>sID="book.chapter.verse.0">...<q who="Jesus" 
>eID="book.chapter.verse.0"> for that, but then if I do that for the 
>KJV, will the application reading the OSIS file add quotation marks? 
>If I use OSIS for a language that uses different quotation marks, 
>what will happen? What about open quote reminders at new paragraphs 
>and stanzas? Will they be inserted when they aren't supposed to be?

This is they key point, isn't it?  "will the application reading the 
OSIS file add quotation marks?" is not a question that can be 
answered. Which application? Reasonable software for formatting XML 
should do what your style sheets say it should do. Perhaps not all 
software is reasonable, but even most CSS implementations give you 
that much control.

Clearly the KJV and the NIV have different styles for quotations. The 
style sheets you would use to generate printed versions of them 
therefore would differ. They might be completely separate, or just 
differ in a few things, or a very clever stylesheet might even check 
what version it's formatting (by looking at the header) and do the 
appropriate thing for any version it knows about, and a default thing 

By not enshrining punctuation in  the text itself, a wider range of 
options are available to the translators, publishers, and other 
concerned parties. For example, if I were printing an NIV in France 
for some reason, I might want to use the French chevron-like 
quotation marks (sorry, I forget the name for them just now). No 
problem: tweak the stylesheet. You don't have to even touch the touch 
the text itself -- thus the risk of accidentally messing it up is 
reduced. This is especially important for minority languages, where 
the typesetter probably doesn't know the language, and so cannot 
easily detect if they messed things up.

Also, these source files will be processed by many things other than 
formatters. Consider blind users with voice-generation interfaces: 
they won't get quotation marks at all -- but if the system knows 
there is a quote starting, it should be able to signal that to them. 
One system might just say "quote" in whatever the user's language is; 
a better system might generate voice inflections or suprasegmentals 
of some sort to communicate the same thing. Second, consider a search 
engine: it shouldn't have to search for a different pattern of 
specific characters to locate quotes in every language it encounters 
(especially when some patterns are ambiguous).

So, it seems to me we definitely need to have markup in there for 
quotes -- the question then is whether OSIS quote markup provides 
sufficient information to drive a formatter, and if not, what to do 
about it.

>The other problem with controlling quotation punctuation with OSIS 
>and always using markup (i. e. q or speech elements) is that there 
>are not just start and end locations. There are also open quote 
>reminder locations. This gets confusing. Can I specify that a 
>quotation starts at a given location with one character, continues 
>at a paragraph boundary with a different character, then ends with 
>still another character? Would it be OK to use a duplicated sID in a 
>q milestone element to indicate that this is a part of the same 
>quotation, but more punctuation is needed here?

Absolutely agreed. We discussed this at length (Patrick, can we add a 
section with some examples for this in the doc, if we haven't yet?). 
Typically, the placement of quotation reminders is determined by some 
fairly simple rule, that may differ by language, writing system, 
culture, and genre (and probably other factors too). Your example of 
a paragraph boundary is a very common case. In such a case, the 
stylesheet rule for paragraph simply checks whether a quotation is 
open, and if so, issues the appropriate punctuation.

This is a valuable approach, because there might well be two 
different groups that share a translation, but live in different 
areas and have become accustomed to different quotation style rules. 
For example, a language group from a war-torn country where many have 
emigrated, and ended up in different countries. If you put the 
literal quote characters in the text for one group, you have to go 
and fix it all manually for the other group. If instead you mark the 
quotes via markup and have a stylesheet generate the correct 
characters for display, then you just change that stylesheet, getting 
a uniform change with much less effort.

Does any of us know of a situation where the placement of "reminder" 
punctuation is discretionary? That is, where we have to record it 
because there is no rule, or a rule so complex, that the marks cannot 
reasonably be generated by a stylesheet? (I'm not including making a 
facsimile edition of a copy text including errors).

In my opinion (and that of my OSIS validation code), it would be 
incorrect to use a duplicate sID for this case as the OSIS schema 
stands right now. It could be that there is need to explicitly mark 
paragraph boundaries inside quotes, rather than letting the style 
sheet do the right thing. If you believe so, can you explain it to me 
in more detail? I'm not quite understanding your point here, and I 
very much want to.

*If* there turns out to be such need, then I see a few simple solutions:

a) Allow additional milestones with the same sID (or possibly eID, 
but I like your sID notion better)

b) Create a new empty element for the purpose, say <q-continued> or similar

c) Reserve a 'type' attribute value somewhere to distinguish this case.

If there really is need, you can simulate solution b or c right now 
in OSIS by using a regular milestone and assigning it a special type 
for this purpose. People (namely, the people writing stylesheets for 
you or doing typesetting) might complain unless you could show why it 
is in fact needed -- but if it really is, then it is.

>In short, I consider the placement of quotation punctuation and the 
>selection of characters to be used for quotation punctuation to be a 
>part of the Bible translation text itself, and if any encoding, like 
>OSIS, cannot guarantee that these characters are maintained in their 
>original locations, then that encoding is defective.

Wow. That's interesting. Let me see if I understand it right: So if I 
published an NIV in France (or better, a Francophone country with an 
English-speaking minority population that wants the NIV), and if I 
used chevrons for quotation marks, you would say it's a different 
*translation*, not just a different printing or edition or layout? I 
must admit I have a hard time accepting that.

As for guaranteeing, no encoding can guarantee the result of applying 
software to it. For all the encoding knows, the formatter you're 
using simply throws out all punctuation marks, or even all the text. 
It seems to me that that doesn't make all encodings defective. There 
must be some more limited claim you're trying to get at here, but I 
don't see clearly what it is. Help, please?

It seems to me that the *fact* of something being a quotation is 
clearly part of the translation text, but that the punctuation marks 
(or whatever) used to communicate that are part of the formatting, 
just like the choice of font. I still consider them very important, 
just as I consider the font choice important (printing a Bible in 
Comic Sans, or in 5 pt type, would probably be a very bad thing to 
do); but to me it wouldn't be changing "the text".

Can you explain this further for me if it's central to your point? 
But it seems to me this is not central -- you just want the quotes 
right, right? And that doesn't require anywhere near so strong a 

>Do you see the problem?

I don't think so. Please explain further.

>Now, let me suggest at least two possible solutions that are easy to 
>incorporate into the OSIS standard. First, let me explicitly state 
>what I'm trying to accomplish:
>1. Preserve the current OPTION in OSIS to generate quotation 
>punctuation with markup.
>2. Preserve the OPTION in OSIS to mark quotations by speaker for 
>specialized searches or, in the case of Jesus' direct quotes, to 
>color or present them in some different way.
>3. Add the OPTION to control quotation punctuation precisely for 
>languages and styles that differ from the "usual" in the type and 
>placement locations of quotation punctuation.
>Suggested solution number 1 (recommended):
>Document that any <q> or <speech> element marked with an attribute 
>of n=" " (a blank space) should not be taken as an instruction to 
>insert any quotation mark. Rather, in this case, it should be 
>assumed that the correct punctuation is already in the text as a 
>Unicode character (just like other kinds of punctuation). <q> or 
><speech> elements not so marked would be taken as an instruction to 
>insert quotation punctuation in the manner that the NIV English 
>Bible does, including open quote reminders, and alternating double 
>and single typographic quotes for nested quotes.

I rather like the idea I perceive here -- some signal that the 
punctuation is already in the text. The stylesheet could use this in 
a nicely general way. I don't think it belongs on the 'n' attribute, 
but that's a minor detail.

Is there a case, though, where a stylesheet couldn't be reasonably 
expected to generate all the right quotation marks? If a language 
required a different quotation mark depending on the voicing of the 
following consonant, or (worse) the gender of the next noun, that 
would be beyond typical stylesheet mechanisms to do. I don't know of 
any languages where punctuation choice depends on linguistic 
phenomena that aren't already represented by other markup or layout 
(like paragraph breaks). If there are, then we have a clear problem 
to deal with. But given the historical development of writing 
systems, that seems to me really unlikely. Anybody know an exception?

>Suggested solution number 2:
>If for some bizarre reason you are opposed to letting quotation 
>punctuation exist as a normal Unicode character in the text, you 
>could (1) allow the exact character to be used to be specified with 
>its hexadecimal code position in the n attribute of the p or speech 
>element, and (2) define two other elements to specify if open quote 
>reminders are appropriate at new paragraphs and stanzas, and (3) 
>specify what the open quote reminder character should be.

Parts 2 and 3 of this would go in a stylesheet, not in the text; you 
can do that now. If the character(s) were to go in an attribute, they 
could just go there -- no need to code in hex. But I don't think 
there's anything preventing such characters in the text in OSIS now 
-- so long as you do still mark the quotes (which is surely necessary 
for most non-printing processing). I'd have to read the fine details 
of the wording to be certain.

>Suggested solution number 3:
>Make something up-- anything that solves the problem above, and ask 
>me if I think it would work or not.

See above.

>By the way, I would be happy to help you proofread and review the 
>next release of OSIS documentation and schema.

Many thanks! Feedback from people who have actual concrete issues to 
deal with is *very* valuable.

>>Hope you are having a great day!
>I am. It is about my bed time, now...


Steve DeRose -- http://www.derose.net
Chair, Bible Technologies Group -- http://www.bibletechnologies.net
Email: sderose at acm.org  or  steve at derose.net

More information about the sword-devel mailing list