<!DOCTYPE html>

<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    On 2023-12-19 04:26, Matěj Cepl wrote:<br>

    <blockquote type="cite"

      cite="mid:CXS7BHXBHYXL.10L3J4OHX6PRS@cepl.eu">

      <pre class="moz-quote-pre" wrap="">On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote:

</pre>

      <span style="white-space: pre-wrap">

</span>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">2. Apostrophes

In English, the apostrophe used for possession (“the boy’s train”) and 

omission (“don’t let’s start") is traditionally set with the same 

character used as the closing single quote, so in any non-trivial 

document there will almost certainly be more "closing single quotes" 

than opening single quotes, it's not worth reporting on.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">Yes, I aware of it, and I feel very blessed that I don’t

have this problem in Czech. I have no idea what to do with

this without proper syntactic analysis, which is out of the

question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)`

and then back, but it seems like a receipe for disaster.</pre>

    </blockquote>

    <p>I think a better solution would be to make the script itself

      aware of when a closing single quote is acting as a closing quote

      or not. If the closing single quote is followed by an alphabetic

      character (it should be able to test Unicode character classes for

      this), then it should be treated as an apostrophe instead. I don’t

      know if biblical texts generally use contractions, but your

      regular expression doesn’t handle contractions generally. Also, I

      only know English and I am quite possibly missing some edge cases.

      Some examples:</p>

    <ul>

      <li>This isn’t a closing quote. (‘t’ is an alphabetic character)<br>

      </li>

      <li>“I said, ‘This is a closing quote within a double-quoted

        phrase’”. (‘”’ isn’t an alphabetic character)<span

        style="white-space: pre-wrap">

</span></li>

    </ul>

    <blockquote type="cite"

      cite="mid:CXS7BHXBHYXL.10L3J4OHX6PRS@cepl.eu">

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">3. Nested quotations

In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell 

other people that she was Abraham’s brother. In the BSB (and NIV, and 

ESV, and NASB) this results in a triple-nested quotation. In English 

typesetting conventions the outermost quotation gets double-quotes, the 

second level gets single-quotes, and the third level gets double quotes 

again. This causes the script to report an error:

I couldn't immediately think of a way to get around this.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">Me neither. We should probably make effort for error recovery, so

that the script would continue even after reporting a problem,

but I am not sure how to do that either.</pre>

    </blockquote>

    The other approach would be checking what the counts are upon

    reaching a terminating section. As mentioned below, in English, all

    quotes are implicitly closed by the end of a paragraph. So any

    nonzero counts at the end of a paragraph are OK. But when you

    encounter a closing quote, you can make sure that the last opening

    quote is the same type of quote.<span style="white-space: pre-wrap"> If you store the opening quote type in a stack, pop whenever you encounter a closing quote while confirming a match, and report an error upon trying to pop an empty stack or encountering an mismatched quote, and clear the stack upon reaching a paragraph end, that would provide something useful for English.

</span>

    <blockquote type="cite"

      cite="mid:CXS7BHXBHYXL.10L3J4OHX6PRS@cepl.eu">

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">Another quirk that occurs to me is that in English typesetting, if one 

person speaks multiple paragraphs (for example, the Sermon on the Mount) 

then each paragraph gets an opening double-quote, but no closing 

double-quote. That's going to play havoc with this kind of 

quote-checking tool, too.

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">Yes, we don’t do this in Czech, but it is typographically

possible to just use paragraph indentation instead

of quoting and of course we don’t have anything like

indentation in the pure XML. I have just added quotes in

the appropriate places and plan sending the patch to the

Czech Biblical Society (after David reviews my fixes in

<a class="moz-txt-link-freetext" href="https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2">https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2</a>)

with some other clear bugs I have found.</pre>

    </blockquote>

    <p>See above.</p>

    <p><span style="white-space: pre-wrap">Unfortunately, it sounds like English speakers would want the script to be aware of different rules per-language, which definitely complicates things. But that would increase the utility in automatically identifying likely transcription errors.

</span></p>

  </body>

</html>