[sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)
Nathan Phillip Brink
ohnobinki at ohnopublishing.net
Tue Dec 19 10:20:25 EST 2023
On 2023-12-19 04:26, Matěj Cepl wrote:
> On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote:
>> 2. Apostrophes
>>
>> In English, the apostrophe used for possession (“the boy’s train”) and
>> omission (“don’t let’s start") is traditionally set with the same
>> character used as the closing single quote, so in any non-trivial
>> document there will almost certainly be more "closing single quotes"
>> than opening single quotes, it's not worth reporting on.
> Yes, I aware of it, and I feel very blessed that I don’t
> have this problem in Czech. I have no idea what to do with
> this without proper syntactic analysis, which is out of the
> question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)`
> and then back, but it seems like a receipe for disaster.
I think a better solution would be to make the script itself aware of
when a closing single quote is acting as a closing quote or not. If the
closing single quote is followed by an alphabetic character (it should
be able to test Unicode character classes for this), then it should be
treated as an apostrophe instead. I don’t know if biblical texts
generally use contractions, but your regular expression doesn’t handle
contractions generally. Also, I only know English and I am quite
possibly missing some edge cases. Some examples:
* This isn’t a closing quote. (‘t’ is an alphabetic character)
* “I said, ‘This is a closing quote within a double-quoted phrase’”.
(‘”’ isn’t an alphabetic character)
>> 3. Nested quotations
>>
>> In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell
>> other people that she was Abraham’s brother. In the BSB (and NIV, and
>> ESV, and NASB) this results in a triple-nested quotation. In English
>> typesetting conventions the outermost quotation gets double-quotes, the
>> second level gets single-quotes, and the third level gets double quotes
>> again. This causes the script to report an error:
>>
>> I couldn't immediately think of a way to get around this.
> Me neither. We should probably make effort for error recovery, so
> that the script would continue even after reporting a problem,
> but I am not sure how to do that either.
The other approach would be checking what the counts are upon reaching a
terminating section. As mentioned below, in English, all quotes are
implicitly closed by the end of a paragraph. So any nonzero counts at
the end of a paragraph are OK. But when you encounter a closing quote,
you can make sure that the last opening quote is the same type of
quote.If you store the opening quote type in a stack, pop whenever you
encounter a closing quote while confirming a match, and report an error
upon trying to pop an empty stack or encountering an mismatched quote,
and clear the stack upon reaching a paragraph end, that would provide
something useful for English.
>> Another quirk that occurs to me is that in English typesetting, if one
>> person speaks multiple paragraphs (for example, the Sermon on the Mount)
>> then each paragraph gets an opening double-quote, but no closing
>> double-quote. That's going to play havoc with this kind of
>> quote-checking tool, too.
> Yes, we don’t do this in Czech, but it is typographically
> possible to just use paragraph indentation instead
> of quoting and of course we don’t have anything like
> indentation in the pure XML. I have just added quotes in
> the appropriate places and plan sending the patch to the
> Czech Biblical Society (after David reviews my fixes in
> https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2)
> with some other clear bugs I have found.
See above.
Unfortunately, it sounds like English speakers would want the script to
be aware of different rules per-language, which definitely complicates
things. But that would increase the utility in automatically identifying
likely transcription errors.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20231219/8c19eb19/attachment-0001.htm>
More information about the sword-devel
mailing list