[sword-devel] Sword OSIS quotation mark handling question
DM Smith
dmsmith555 at yahoo.com
Mon Apr 30 15:02:54 MST 2007
Kahunapule Michael Johnson wrote:
> DM Smith wrote:
>> Kahunapule Michael Johnson wrote:
>>
>>> How does the Sword project handle display of OSIS text quotations when:
>>> 1. the <q> or <speech> element is used without a marker attribute,
>>>
>>>
>> The speech element is not handled, except to process its content. It is
>> as if the element were not in the text at all. I think the speech
>> element is to indicate the speaker, not that what's said is a quote. I
>> won't mention the element <speech> below.
>>
> OK. I have no need to generate the <speech> element, as there is no
> USFM equivalent, so I'll ignore it, too. :-)
>> Assuming that the module's conf does not have osisQToTick=false (i.e. it
>> defaults to true when not present), then the level attribute determines
>> the quotation mark that will be used, alternating double quote and then
>> single quote. If no level attribute is present, then it uses a double quote.
>>
>> It will use the same mark when it gets to </q>.
>>
> In that case, would open quote reminders be inserted at paragraph and
> stanza beginnings automatically, or would that require a cQuote
> milestone to make happen? (I'm just curious. Normally, I'm interested
> in just making sure this doesn't happen, since the quotation
> punctuation is already fully specified, and it may not conform to
> current English usage. However, in the hypothetical case where someone
> wanted this to happen, I'm curious how it would be done.)
It requires a cQuote milestone to make it happen.
>> The same holds true when milestoned versions of <q> are used, except
>> that <q eID="xxx"/> elements will not cause the code to look at the
>> opening <q sID="xxx"/> for a marker attribute. Instead, it will use the
>> marker attribute, or it's lack to determine what to output.
>>
> So in the milestone elements, markers may vary. That is actually good,
> since sometimes quotes are introduced with an em dash and close with a
> newline, or some other asymmetrical case.
They can be anything you want them to be, even being more than a single
character.
>> However, if osisQToTick=false, no quotation mark is used.
>>
> So osisQToTick=false is essentially equivalent to putting a marker=""
> attribute on all <q> elements?
Yes.
>>> 2. the <q> or <speech> element is used with a marker attribute,
>>>
>>>
>>
>> When the marker attribute is present, it is used.
>>
> Good. :-)
>>> 3. no <q> or <speech> elements appear, or
>>>
>>>
>> Then as far as sword is concerned then it is not in a quote.
>>
> OK... what, exactly, does that mean? Does that make a difference for
> anything besides the option of rendering Words of Jesus in red (or
> some other alternate color) for display? Normally, the point of
> knowing if something is in a quote or not is to display the quotation
> marks correctly, but if there are no quotation marks to display (or
> they are already in the text in whatever way is appropriate for that
> language), then Sword doesn't actually need to "know" when something
> is a quote or not, does it? Or is there some search feature or
> function that I'm not aware of that would use such knowledge?
Maybe I didn't understand 3. If a verse or passage is rendered and a
<q>, <q/> or </q> is not found, then there is no way of knowing that
verse or passage is in a quote.
The Sword engine does not know and the front-ends don't try to figure it
out.
>>> 4. quotation punctuation (“, ‘, ’, ”, «, », —, newline, etc.) appears
>>> outside of <q> or <speech> elements (i. e., not in a marker attribute)?
>>>
>>>
>> Any punctuation in the text is produced as is.
>>
> This is good. Very good. :-)
Some frontends might not be able to handle it.
>> Another feature of OSIS is <milestone type="cQuote" marker="xxxx"/>
>> This is used for a continuation quote. (substitute xxxx with the
>> appropriate quote mark)
>>
> This is good to know. I regard this (or something like it) as an
> essential feature if all quotation marks are going to be put in markup.
>> Words of Christ (WoC) can be indicated by adding who="Jesus" to the <q>
>> container element or to both the milestone elements. In the KJV, ESV and upcoming NASB modules, the WoC are marked on a per
>> verse basis, using the container form of <q>, with marker="".
>>
> This is an interesting concept-- and one that is helpful to me. You
> see, I thought that marking WoC per verse was bad OSIS the way I read
> the documentation, but it sure makes conversion from USFM (which
> actually demands that sort of markup) easier (because I don't have to
> discard adjacent end + start pairs with no actual text in between,
> just a verse marker), and it also makes display easier on a
> verse-by-verse basis (like Sword does) easier if you are working from
> raw OSIS. The same technique would be useful for translating the USFM
> \qt ...\qt* markup (which is marked verse-by-verse to indicate OT
> quotes in the NT) to <q marker="" who="OT" sID="somethingunique">...<q
> marker="" who="OT" eID="somethingunique">. If you regard this as
> acceptable, then I'll just embrace it quickly before anyone objects. :-)
We have had our discussions about this. There are front-end problems
with marking it up at the start and end of the WoC:
For systems that put each verse in a html table cell (as swordweb does
in parallel view) verses that have a WoC end quote, but not a begin
quote, then these will not display properly.
For Matt 5-7, displaying chapter 6 in any frontend will not display in
"red".
>
> OSIS is very flexible, and there seem to be many reasonable ways to
> interpret how Scriptures should be encoded. At this point, there are
> so many ideas out there, I would like to just start with one goal:
> encoding OSIS texts from USFM in such a way that Sword displays them
> properly. If that works, then there is a good chance the resulting
> OSIS will be of use to others, as well.
>
> Would it be too weird to separate q elements intended for replacing
> punctuation (with marker specified) from those used for what is
> essentially a character style (i. e. WoC)? Like <q marker="“"
> sID="aoeu"/><q marker="" sID="qjkx" who="Jesus"/> (actual quotation)
> <q marker="" eID=qjkx" who=Jesus/><q marker="”" eID="aoeu"/>, where
> the actual quotation may span several verses, and the inside set of
> markers may be ended and restarted with each verse?
I'm not sure I understand. The important thing is to test it in a Sword
application to see if it does what you want it to. If it does not, it
might require a change to the Sword engine or it might be simpler to
change your transformer.
You can take a look at the XML for the KJV here:
http://www.crosswire.org/~dmsmith/kjv2006/sword/kjvxml.zip
It is a good example of how to do WoC all Sword frontends can handle it.
>>> I want to (1) ensure that Bible texts are displayed correctly, and (2)
>>> minimize the amount of manual labor necessary to make #1 happen.
>>>
>>> It should not be necessary to do any manual editing of Bible source
>>> texts in well-formed Unicode USFM to create a valid Sword module. (USFM
>>> or something close to it is the format in which a very large number of
>>> minority-language Bibles exist.) In USFM, quotation punctuation, if any,
>>> is in the text of the document, with no special markup. In an informal
>>> extension to USFM, sometimes << is used for “, < for ‘, etc. (A space is
>>> required to disambiguate “‘ and ‘“.) Speaking of ambiguity, apostrophe,
>>> closing single quote, and (in some languages) glottal stop all use the
>>> same character. This ambiguity, coupled with language and style
>>> considerations, seems to be a serious problem in automatically
>>> converting from either GBF or USFM to OSIS, in general.
>>>
>>>
>> I have recently written a quote recognizer in C++. I did find that an
>> apostrophe is potentially ambiguous, but in the source I was working, it
>> was not an issue.
>>
>> Fortunately, my input use ` for a single quote start and ' for an end
>> quote. This made disambiguation significantly easier.
>>
>> If you wish, I can send you the routine.
>>
> I already have some LGPL C# code that does a reasonably accurate job
> of recognizing quotation marks in English text that I use for checking
> quotation-mark balancing. It doesn't work very well for other
> languages, because it uses some English-specific rules to disambiguate
> apostrophes and closing single quotes, and doesn't even handle the
> case where the same marker is used for glottal stop. (The latter is
> bad practice in Unicode, but some people do it anyway.) Does your
> quote recognizer work for non-English Bibles with different writing
> systems and different punctuation rules?
I don't know. It does work for Spanish. I'll send you my code and you
can decide.
>>> I'm wondering if I should target OSIS or GBF as a target format for a
>>> converter I'm writing, and also working on updating the dialect of OSIS
>>> that the World English Bible and HNV are distributed in. While I'm not
>>> in favor of dropping support for GBF, yet, I'm not very thrilled about
>>> the idea of putting any new work into supporting it, either. However, if
>>> I can't make an OSIS module without a lot of manual labor, any
>>> reasonable alternative is worth considering.
>>>
>>>
>> Remembering your earlier posts about OSIS's lack of quotation support, I
>> think I can now say that it provides you the level of control that you
>> wish. Having done three modules myself, I think that OSIS 2.1.1 is
>> sufficient for Bible texts.
>>
>> So, I'd suggest OSIS.
>>
> Indeed, it looks like I have at least two ways to get the level of
> quotation support I want: (1) always put quotation punctuation in
> marker attributes of q elements or cQuote milestone elements and
> specify empty marker elements when using q just for WoC, or (2) [pause
> to don body armor and start running] always put quotation punctuation
> in the text and use q elements with empty marker attributes just for
> translating USFM \wj ...\wj* and \qt ...\qt* markup on a per-verse
> basis. Option #1 has the major disadvantage of requiring finding all
> of the quotation punctuation in text I may not be able to read, let
> alone understand the grammar of, for conversion purposes. Option #2
> has the disadvantage of potentially offending certain people who have,
> at least so far, held the deep religious conviction that all quotation
> punctuation should live in markup, not the text of the Bible, but it
> has the major advantage of the simplest, fastest conversion possible
> from USFM to OSIS, with no manual labor required for each translation
> (other than making sure the source text is really in Unicode USFM).
> Although option #2 seems like it would work just fine, at least
> functionally if not idealistically, I'm concerned that someone might
> think such texts weren't pure enough OSIS, and not use them. If that
> is the case, then perhaps I really would be better off going back to
> GBF... or just punting on this whole converter and move on to
> improving my converters to other formats for other Bible study software.
My personal opinion: It is more important to have excellent modules than
to quibble over this.
>
> In the case where the translators have made use of the <<, <, >, >>
> quotation markup option in their SFM, which is actually a fair number
> of them, I would like to convert those to the appropriate q elements
> with markup specifying the normal equivalent of those markings. I'm
> loathe to mess with apostrophe/ending single quote disambiguation for
> non-English texts, though. I don't see any benefit to doing so,
> really, but maybe I'm missing something?
I think that it requires a lot of analysis for each language to
determine whether the apostrophe disambiguation worked or not. It may
not be worth the effort.
>
> What do you think?
>
> Michael
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
More information about the sword-devel
mailing list