[sword-devel] Re: What is markup?

Sat, 20 Mar 2004 16:18:29 +1000

Here is the answer. Putting the quote mark in the text is a 
reversible, lossless encoding for all languages and styles. Putting 
the quote mark in an attribute of a <q ...> element is NOT lossless 
encoding for all languages and styles.

Lossless encoding is like this. I lend you US$100.00 and you write 
down a note to yourself to repay me US$100.00 on payday and you don't 
lose the note. On payday, you read the note, and you hand me back 
exactly US$100.00.

Lossy encoding is like this. I lend you US$100.00 and you write down a 
note to yourself to repay me $100.00 on payday. On payday, you give me 
back HK$100.00. Since one Hong Kong dollar is worth $0.1283, you 
shortchanged me US$87.17.

Losslessly encoding and decoding
"The 'hasty' brown <<fox>> jumped over the 'soporific' dog's 
backside."
yields
"The 'hasty' brown <<fox>> jumped over the 'soporific' dog's 
backside."

Lossy encoding and decoding
"The 'hasty' brown <<fox>> jumped over the 'soporific' dog's 
backside."
might yield
"The hasty brown fox jumped over the soporific dogs backside."
or it might yield
"The <<hasty>> brown 'fox' jumped over the <<soporific>> dog>s 
backside."
or it might yield
"The 'hasty' brown fox jumped over the 'soporific' dog's backside."
depending on the parameters used in a particular instance. It might 
accidentally give me back the same string I started with.

Consider, please, the following situation. I desire to encode many 
different Bible translations, in many different languages. Among these 
are languages which use a different rules and different characters for 
punctuation marks. Some of them use opening and closing quotation 
marks, and some don't. Some use different punctuation than you use in 
English. Some change the way things are punctuated inside of the 
quotation, and some don't. Some require "reminder" marks of various 
sorts at differing places within the quotation. Some have different 
ways to indicate quotations which carry subtle meanings themselves. I 
want you to guarantee that I can losslessly encode and decode each and 
every one of these translations with "standard" processes, with the 
punctuation always put in the right place in the rendered text. I want 
you to do this without requiring me to supply any additional 
information that is not in the OSIS document.

If I never use <q ...> or <speech ...>, and always put the punctuation 
in place with glyphs representing the correct punctuation in the text 
exactly the same way that I would put any other punctuation or 
alphabetic characters in the text, then I can be assured of that 
working. I can be assured, that is, unless some unthoughtful person 
alters my text by trying to follow your bad recommendation to replace 
all quotation punctuation with <q ...> or <speech ...> elements in 
cases where the punctuation conventions differ from the English of the 
NIV.

The <q ...> and <speech ...> elements are never required to correctly 
render the text, if all of the punctuation, including quotation marks, 
is included in the text in the same manner.

Let me see if I can correctly explain to you why you don't want to do 
a proper lossless encoding of quotation punctuation, and then I will 
propose a solution for both points of view.

First of all, if all texts use the same quotation punctuation rules as 
the NIV, which can (with a few possible exceptions) be automatically 
and accurately generated from <q ...> and <speech ...> elements 
without n attributes. Therefore, you are probably thinking that doing 
such generation is really effectively lossless most of the time (by 
luck and not by design). Most of your "customers" would probably think 
so, too. After all, it is only languages you don't speak or read that 
need different rules, and a few "odd" English translations such as the 
NASB, ASV, KJV, etc., that don't fit your mold. You summarily shrug 
those off by saying that the publisher and renderer must somehow 
specify these "exceptions" with some kind of rendering style 
information. Therefore, this doesn't seem like a big deal to you. (For 
me, it is a huge issue that will cause me to accept or reject OSIS 
altogether, depending on your response, but I can understand that you 
might not think it is important at all.)

Marking quotations with XML markup to indicate when we are in a 
quotation and who is being quoted allows the process reading the OSIS 
text to "know" when something is a quote or not, and possibly who is 
being quoted. This information can be used as part of the criteria of 
a search, or maybe to influence rendering (i. e. for a red letter 
edition).

Allowing the markup used to indicate quotations to also generate 
punctuation according to NIV rules also makes life easier for people 
working in translations that actually use those rules, because they 
don't have to remember to put in the open quote reminders at the 
beginnings of paragraphs.

I acknowledge those advantages, and do not wish to deprive you of 
them. However, I insist that you not neglect my favorite "odd" cases. 
If you allow the XML markup to specify opening or closing of 
quotations in such a way that the creator of the OSIS document can 
specify that quotation punctuation be generated or not from the 
markup, then you could still enhance properly punctuated text for 
enhanced search capabilities and rendering red letter editions without 
messing up the punctuation, even if the punctuation used is not NIV 
English standard. In fact, the same rules would work on NIV English 
standard text, too. Take your pick. Either works, with no loss of 
capabilities either way. Only when you deviate from NIV English rules 
do the advantages of the total separation of punctuation generation 
from quotation markup become clear. You actually proposed an 
acceptable solution (using n=""), but you keep trying to tell me that 
is bad to do for reasons that are not convincing or even logical.

>If I were to encode a Bible at the character level as follows:
><verse osisID="Gen.1.1"><c value='I'/><c value='n'/><space/><c
>value='t'/><c value='h'/><c value='e'/>...</verse>
>
>vs
>
><verse osisID="Gen.1.1">In the...</verse>
>
>Are the characters "In the" any more or less a part of the encoding
>either way?

No, but the first encoding is exceedingly inefficient and ugly. It 
reminds me of HTML email messages generated by spammers. Nevertheless, 
such ugly and inefficient encoding can be lossless.

>By using XML you MUST entities for some characters (<, >, /, ...).
>These are not plain text but rather a place holder for those 
>characters.

Fine. Those encodings are lossless. They are not a problem.

>Most encoders are satisfied to logically represent the start and end
>quote marks with the <q> element it self and let the rendering 
>process
>choose the glyph to be rendered.

I am not "most encoders," but I am content to let them do what they 
want. Let them trust the rendering process to insert the correct 
punctuation if they are using NIV English rules.

> The point you bring is that there are
>cases where this is not sufficient, because not all the information 
>the
>translator intended can be represented with this more simplistic 
>model.

Correct.

>What I suggested with the use of the "n" attribute was that rather 
>than
>simply encoding a <q> element that records the start and end of a 
>quote
>(and having that character to render be up to the rendering process), 
>we
>could also allow the option for the encoder to specify that a 
>specific
>character should be used rather than leaving it up to the rendering
>process.  

That is a small step in the long journey, but a step in the right 
direction. You still haven't dealt with open quote reminders within a 
quote. To do that unambiguously, you would have to insert additional 
markup at the points of insertion, and then you would be back to 
something that looks kind of like your lame example of encoding one 
letter per XML element.

>The thing that is troubling with <q n="" sID="uniqueID"/>"text 
>text"<q
>n="" eID="uniqueID"/> is that you have said that there is a quote 
>that
>has no punctuation to delimit and that within that quote there is a
>character ["] that is simply a character and DOES NOT carry the 
>meaning
>that a quote is starting or ending but rather that there is a word
>["text] at the first of the quote and another word [text"] at the end 
>of
>the quote.

I interpret the same example slightly differently than you, and I see 
no contradiction or troubling features to it at all. The <q n="" 
sID="uniqueID"/> tells the reading process that this is the beginning 
of a quotation and that it is not permitted to insert any punctuation 
because of this opening of a quote-- not here, and not at the 
beginning of any paragraph within the quote. The opening and closing 
quotation marks surrounding "text text" are not for the computer's 
benefit. They are for the benefit of the people reading the text in 
their own language. The correct punctuation may well not be the double 
quotes used, or the typographic versions of the same, but may be 
Unicode U+00AB and U+00BB or some other marks, but they are there for 
the benefit of the human reader, not the computer. If the computer 
process thinks that the punctuation is part of a word next to it, that 
doesn't matter if all it needs to know about words is that it must not 
break words in two at line boundaries. Of course, a more intelligent 
reading process could recognize that it is punctuation from a Unicode 
database, and separate it from the word, so that it can accurately 
compile a concordance of the words in the current Bible translation. 
This is not just something that has to do with quotation marks, but 
all punctuation. Of course, the <q n="" eID="uniqueID"/> tells the 
reading process that the quote has ended, and don't put in any 
punctuation in honor of this event. In this case, the whole point of 
the <q ...> markup is to "tell" the computer, not the people, where a 
quote is. In this particular case, the <q ...> markup is probably 
pointless unless you add who="whoever" parameters and use this feature 
for searching the Scriptures by speaker, or in the case of Jesus' 
direct quotes, rendering a "red letter" edition.

In short, if you fully support lossless encoding of less common cases 
(many of which are in my bookshelf and in archives in SFM files where 
I work) without denigrating or depreciating that solution in any way, 
then I will be able to continue to support and use OSIS. If not, I 
have alternatives that I will use instead.

Did I mention that I want lossless encoding for any Bible translation 
for any living, written language on earth? I do. I will not accept 
anything less. Not one jot or tittle may go missing or be inserted 
where it does not belong.

Your friend and adversary in this iron-sharpening contest,

Michael
former OSIS supporter
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: http://eBible.org/mpj/gpg.htm

iD8DBQFAW+I1RI/gxxfXR7sRAvvqAJ9trjkbCKeVl7WwdSmiTVGux2xyEQCgxBY6
KnjW9Hq53UJt8vTO0OCH5EM=
=vBud
-----END PGP SIGNATURE-----