[sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Matěj Cepl mcepl at cepl.eu
Tue Dec 19 04:26:42 EST 2023


On Tue Dec 19, 2023 at 2:17 AM CET, Timothy Allen wrote:
> I tried running it over my BSB module, and I hit problems fairly 
> quickly, some of which are more easily solved than others.
>
> 1. No support for language “en”
>
> This was easy enough to handle, there's a configuration variable near 
> the top of the file that lets you configure which quotes are used for 
> which languages.

Patch sent to my email would be welcome.

> 2. Apostrophes
>
> In English, the apostrophe used for possession (“the boy’s train”) and 
> omission (“don’t let’s start") is traditionally set with the same 
> character used as the closing single quote, so in any non-trivial 
> document there will almost certainly be more "closing single quotes" 
> than opening single quotes, it's not worth reporting on.

Yes, I aware of it, and I feel very blessed that I don’t
have this problem in Czech. I have no idea what to do with
this without proper syntactic analysis, which is out of the
question. Perhaps, running `re.sub(r'’s\b', '@#s', whole_text)`
and then back, but it seems like a receipe for disaster.

> 3. Nested quotations
>
> In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell 
> other people that she was Abraham’s brother. In the BSB (and NIV, and 
> ESV, and NASB) this results in a triple-nested quotation. In English 
> typesetting conventions the outermost quotation gets double-quotes, the 
> second level gets single-quotes, and the third level gets double quotes 
> again. This causes the script to report an error:
>
> I couldn't immediately think of a way to get around this.

Me neither. We should probably make effort for error recovery, so
that the script would continue even after reporting a problem,
but I am not sure how to do that either.

> Another quirk that occurs to me is that in English typesetting, if one 
> person speaks multiple paragraphs (for example, the Sermon on the Mount) 
> then each paragraph gets an opening double-quote, but no closing 
> double-quote. That's going to play havoc with this kind of 
> quote-checking tool, too.

Yes, we don’t do this in Czech, but it is typographically
possible to just use paragraph indentation instead
of quoting and of course we don’t have anything like
indentation in the pure XML. I have just added quotes in
the appropriate places and plan sending the patch to the
Czech Biblical Society (after David reviews my fixes in
https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2)
with some other clear bugs I have found.

> Perhaps this kind of tool just isn't suited to checking English text... 
> but I'm sure there's other languages with more sensible conventions that 
> it could help with. Good luck with it!

With https://gitlab.com/crosswire-bible-society/CzeCEP/-/merge_requests/4/diffs
I have managed to make CzeCEP behave. Now I will try other Czech modules.

Blessings,

Matěj

-- 
http://matej.ceplovi.cz/blog/, @mcepl at floss.social
GPG Finger: 3C76 A027 CA45 AD70 98B5  BC1D 7920 5802 880B C9D8
 
Power tends to corrupt and absolute power corrupts
absolutely. Great men are almost always bad men, […]
  -- Lord Acton (including the more important part of the often
     misquoted statement)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 216 bytes
Desc: not available
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20231219/b36233a5/attachment.sig>


More information about the sword-devel mailing list