[sword-devel] Python script for checking pairwise characters (PROFF-OF-CONCEPT)

Timothy Allen thristian at gmail.com
Mon Dec 18 20:17:23 EST 2023


On 19/12/23 00:06, Matěj Cepl wrote:
> I have decided not to rely on very kind help by David
> with his Windows tools and I have written (hopefully)
> completely platform neutral pure Python 3 script for checking
> pairwise-characters. So, far it was used only for fixing
> https://gitlab.com/crosswire-bible-society/CzeCEP/-/issues/2  and
> I am quite sure it is pretty buggy, but it could be proven useful
> for somebody.

Thank you for doing this work! This seems like it could be a useful tool 
for validating texts of all kinds.

I tried running it over my BSB module, and I hit problems fairly 
quickly, some of which are more easily solved than others.

1. No support for language “en”

This was easy enough to handle, there's a configuration variable near 
the top of the file that lets you configure which quotes are used for 
which languages.

2. Apostrophes

In English, the apostrophe used for possession (“the boy’s train”) and 
omission (“don’t let’s start") is traditionally set with the same 
character used as the closing single quote, so in any non-trivial 
document there will almost certainly be more "closing single quotes" 
than opening single quotes, it's not worth reporting on.

I got around this by just deleting single quotes from the configuration.

3. Nested quotations

In Genesis 20:11-13, Abraham tells Abimelech that he told Sarah to tell 
other people that she was Abraham’s brother. In the BSB (and NIV, and 
ESV, and NASB) this results in a triple-nested quotation. In English 
typesetting conventions the outermost quotation gets double-quotes, the 
second level gets single-quotes, and the third level gets double quotes 
again. This causes the script to report an error:

> Balance for  character “ is over one in Gen.20.13

I couldn't immediately think of a way to get around this.

Another quirk that occurs to me is that in English typesetting, if one 
person speaks multiple paragraphs (for example, the Sermon on the Mount) 
then each paragraph gets an opening double-quote, but no closing 
double-quote. That's going to play havoc with this kind of 
quote-checking tool, too.

Perhaps this kind of tool just isn't suited to checking English text... 
but I'm sure there's other languages with more sensible conventions that 
it could help with. Good luck with it!


Timothy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20231219/f5d56cb1/attachment.htm>


More information about the sword-devel mailing list