[sword-devel] Re: What is markup?
Michael Paul Johnson
sword-devel@crosswire.org
Sat, 20 Mar 2004 16:18:29 +1000
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
At 02:25 20-03-04, Todd Tillinghast wrote:
>Michael,
>
>I am trying to understand why you think by putting quote marks "in
>the
>text" rather than in an attribute makes the quote mark any more or
>less
>a part of the "Bible text".
Here is the answer. Putting the quote mark in the text is a
reversible, lossless encoding for all languages and styles. Putting
the quote mark in an attribute of a <q ...> element is NOT lossless
encoding for all languages and styles.
Lossless encoding is like this. I lend you US$100.00 and you write
down a note to yourself to repay me US$100.00 on payday and you don't
lose the note. On payday, you read the note, and you hand me back
exactly US$100.00.
Lossy encoding is like this. I lend you US$100.00 and you write down a
note to yourself to repay me $100.00 on payday. On payday, you give me
back HK$100.00. Since one Hong Kong dollar is worth $0.1283, you
shortchanged me US$87.17.
Losslessly encoding and decoding
"The 'hasty' brown <<fox>> jumped over the 'soporific' dog's
backside."
yields
"The 'hasty' brown <<fox>> jumped over the 'soporific' dog's
backside."
Lossy encoding and decoding
"The 'hasty' brown <<fox>> jumped over the 'soporific' dog's
backside."
might yield
"The hasty brown fox jumped over the soporific dogs backside."
or it might yield
"The <<hasty>> brown 'fox' jumped over the <<soporific>> dog>s
backside."
or it might yield
"The 'hasty' brown fox jumped over the 'soporific' dog's backside."
depending on the parameters used in a particular instance. It might
accidentally give me back the same string I started with.
Consider, please, the following situation. I desire to encode many
different Bible translations, in many different languages. Among these
are languages which use a different rules and different characters for
punctuation marks. Some of them use opening and closing quotation
marks, and some don't. Some use different punctuation than you use in
English. Some change the way things are punctuated inside of the
quotation, and some don't. Some require "reminder" marks of various
sorts at differing places within the quotation. Some have different
ways to indicate quotations which carry subtle meanings themselves. I
want you to guarantee that I can losslessly encode and decode each and
every one of these translations with "standard" processes, with the
punctuation always put in the right place in the rendered text. I want
you to do this without requiring me to supply any additional
information that is not in the OSIS document.
If I never use <q ...> or <speech ...>, and always put the punctuation
in place with glyphs representing the correct punctuation in the text
exactly the same way that I would put any other punctuation or
alphabetic characters in the text, then I can be assured of that
working. I can be assured, that is, unless some unthoughtful person
alters my text by trying to follow your bad recommendation to replace
all quotation punctuation with <q ...> or <speech ...> elements in
cases where the punctuation conventions differ from the English of the
NIV.
The <q ...> and <speech ...> elements are never required to correctly
render the text, if all of the punctuation, including quotation marks,
is included in the text in the same manner.
Let me see if I can correctly explain to you why you don't want to do
a proper lossless encoding of quotation punctuation, and then I will
propose a solution for both points of view.
First of all, if all texts use the same quotation punctuation rules as
the NIV, which can (with a few possible exceptions) be automatically
and accurately generated from <q ...> and <speech ...> elements
without n attributes. Therefore, you are probably thinking that doing
such generation is really effectively lossless most of the time (by
luck and not by design). Most of your "customers" would probably think
so, too. After all, it is only languages you don't speak or read that
need different rules, and a few "odd" English translations such as the
NASB, ASV, KJV, etc., that don't fit your mold. You summarily shrug
those off by saying that the publisher and renderer must somehow
specify these "exceptions" with some kind of rendering style
information. Therefore, this doesn't seem like a big deal to you. (For
me, it is a huge issue that will cause me to accept or reject OSIS
altogether, depending on your response, but I can understand that you
might not think it is important at all.)
Marking quotations with XML markup to indicate when we are in a
quotation and who is being quoted allows the process reading the OSIS
text to "know" when something is a quote or not, and possibly who is
being quoted. This information can be used as part of the criteria of
a search, or maybe to influence rendering (i. e. for a red letter
edition).
Allowing the markup used to indicate quotations to also generate
punctuation according to NIV rules also makes life easier for people
working in translations that actually use those rules, because they
don't have to remember to put in the open quote reminders at the
beginnings of paragraphs.
I acknowledge those advantages, and do not wish to deprive you of
them. However, I insist that you not neglect my favorite "odd" cases.
If you allow the XML markup to specify opening or closing of
quotations in such a way that the creator of the OSIS document can
specify that quotation punctuation be generated or not from the
markup, then you could still enhance properly punctuated text for
enhanced search capabilities and rendering red letter editions without
messing up the punctuation, even if the punctuation used is not NIV
English standard. In fact, the same rules would work on NIV English
standard text, too. Take your pick. Either works, with no loss of
capabilities either way. Only when you deviate from NIV English rules
do the advantages of the total separation of punctuation generation
from quotation markup become clear. You actually proposed an
acceptable solution (using n=""), but you keep trying to tell me that
is bad to do for reasons that are not convincing or even logical.
>If I were to encode a Bible at the character level as follows:
><verse osisID="Gen.1.1"><c value='I'/><c value='n'/><space/><c
>value='t'/><c value='h'/><c value='e'/>...</verse>
>
>vs
>
><verse osisID="Gen.1.1">In the...</verse>
>
>Are the characters "In the" any more or less a part of the encoding
>either way?
No, but the first encoding is exceedingly inefficient and ugly. It
reminds me of HTML email messages generated by spammers. Nevertheless,
such ugly and inefficient encoding can be lossless.
>By using XML you MUST entities for some characters (<, >, /, ...).
>These are not plain text but rather a place holder for those
>characters.
Fine. Those encodings are lossless. They are not a problem.
>Most encoders are satisfied to logically represent the start and end
>quote marks with the <q> element it self and let the rendering
>process
>choose the glyph to be rendered.
I am not "most encoders," but I am content to let them do what they
want. Let them trust the rendering process to insert the correct
punctuation if they are using NIV English rules.
> The point you bring is that there are
>cases where this is not sufficient, because not all the information
>the
>translator intended can be represented with this more simplistic
>model.
Correct.
>What I suggested with the use of the "n" attribute was that rather
>than
>simply encoding a <q> element that records the start and end of a
>quote
>(and having that character to render be up to the rendering process),
>we
>could also allow the option for the encoder to specify that a
>specific
>character should be used rather than leaving it up to the rendering
>process.
That is a small step in the long journey, but a step in the right
direction. You still haven't dealt with open quote reminders within a
quote. To do that unambiguously, you would have to insert additional
markup at the points of insertion, and then you would be back to
something that looks kind of like your lame example of encoding one
letter per XML element.
>The thing that is troubling with <q n="" sID="uniqueID"/>"text
>text"<q
>n="" eID="uniqueID"/> is that you have said that there is a quote
>that
>has no punctuation to delimit and that within that quote there is a
>character ["] that is simply a character and DOES NOT carry the
>meaning
>that a quote is starting or ending but rather that there is a word
>["text] at the first of the quote and another word [text"] at the end
>of
>the quote.
I interpret the same example slightly differently than you, and I see
no contradiction or troubling features to it at all. The <q n=""
sID="uniqueID"/> tells the reading process that this is the beginning
of a quotation and that it is not permitted to insert any punctuation
because of this opening of a quote-- not here, and not at the
beginning of any paragraph within the quote. The opening and closing
quotation marks surrounding "text text" are not for the computer's
benefit. They are for the benefit of the people reading the text in
their own language. The correct punctuation may well not be the double
quotes used, or the typographic versions of the same, but may be
Unicode U+00AB and U+00BB or some other marks, but they are there for
the benefit of the human reader, not the computer. If the computer
process thinks that the punctuation is part of a word next to it, that
doesn't matter if all it needs to know about words is that it must not
break words in two at line boundaries. Of course, a more intelligent
reading process could recognize that it is punctuation from a Unicode
database, and separate it from the word, so that it can accurately
compile a concordance of the words in the current Bible translation.
This is not just something that has to do with quotation marks, but
all punctuation. Of course, the <q n="" eID="uniqueID"/> tells the
reading process that the quote has ended, and don't put in any
punctuation in honor of this event. In this case, the whole point of
the <q ...> markup is to "tell" the computer, not the people, where a
quote is. In this particular case, the <q ...> markup is probably
pointless unless you add who="whoever" parameters and use this feature
for searching the Scriptures by speaker, or in the case of Jesus'
direct quotes, rendering a "red letter" edition.
In short, if you fully support lossless encoding of less common cases
(many of which are in my bookshelf and in archives in SFM files where
I work) without denigrating or depreciating that solution in any way,
then I will be able to continue to support and use OSIS. If not, I
have alternatives that I will use instead.
Did I mention that I want lossless encoding for any Bible translation
for any living, written language on earth? I do. I will not accept
anything less. Not one jot or tittle may go missing or be inserted
where it does not belong.
Your friend and adversary in this iron-sharpening contest,
Michael
former OSIS supporter
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (MingW32)
Comment: http://eBible.org/mpj/gpg.htm
iD8DBQFAW+I1RI/gxxfXR7sRAvvqAJ9trjkbCKeVl7WwdSmiTVGux2xyEQCgxBY6
KnjW9Hq53UJt8vTO0OCH5EM=
=vBud
-----END PGP SIGNATURE-----