[sword-devel] Soft hyphens
David Haslam
dfhmch at googlemail.com
Thu Nov 2 14:16:32 MST 2017
Regexp `([ [:punct:]]\xAD|\xAD[ [:punct:]])` is a reasonable definition for a
"useless soft hyphen",
unless in the language there is a punctuation mark that is used as part of a
word.
The inventors of some alphabets chose more wisely than others by allocating
for the glottal stop the character called "modifier letter turned comma"
rather than the simple apostrophe or "right single quotation mark".
e.g. Hawaian and Tongan.
But I take your point.
And yes, Lingala does use the "right single quotation mark" as part of a
word.
U+00AB « 5,959 LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
U+00BB » 5,956 RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
U+2018 ‘ 694 LEFT SINGLE QUOTATION MARK
U+2019 ’ 8,072 RIGHT SINGLE QUOTATION MARK
U+201A ‚ 3 SINGLE LOW-9 QUOTATION MARK
U+201C “ 34 LEFT DOUBLE QUOTATION MARK
U+201D ” 21 RIGHT DOUBLE QUOTATION MARK
That there are more than 8000 instances of U+2019 is evidence of this use.
Some of the left ones may be typos or there may be some real use as a third
level of quotation mark?
Anyway, I just checked the OSIS XML file from Cyrille from 5 days ago.
There were no occurrences of either `\xAD\x{2019}` or `\x{2019}\xAD`
So in that sense it was a safe thing to do when I removed the "useless"
ones.
Yet given the overall purpose of the soft hyphen, it seems to me now that
it's a really question far better to be addressed during text development
than during module build or within SWORD filtering.
Fr Cyrille has agreed that we can postprocess the generated OSIS file by
removing them.
The only unsettled question concerns the USFM files themselves.
Best regards,
David
--
Sent from: http://sword-dev.350566.n4.nabble.com/
More information about the sword-devel
mailing list