[sword-devel] Fw: Repurposing U+2019 RIGHT SINGLE QUOTATION MARK as a Lexical Word Divider for the SE Asian scripts that have NO SPACE BETWEEN WORDS

David Haslam dfhdfh at protonmail.com
Thu May 29 10:46:57 EDT 2025


NB. I have cancelled the earlier email because the attachment was too large for sword-devel.
It had been in the queue for moderator approval.

The eXperimental module KhmerNTx.zip may now be downloaded from this [link](https://app.box.com/s/e613wf1qdxbjmvux9gbb6vmes33d2rol) on my box.net account.

Please see below for the significant details.

Best regards,

David

Sent with [Proton Mail](https://pr.tn/ref/SWXT9A5YZ67G) secure email.

------- Forwarded Message -------
From: David Haslam <dfhdfh at protonmail.com>
Date: On Thursday, May 29th, 2025 at 9:26 AM
Subject: Repurposing U+2019 RIGHT SINGLE QUOTATION MARK as a Lexical Word Divider for the SE Asian scripts that have NO SPACE BETWEEN WORDS
To: sword-devel mailing list <sword-devel at crosswire.org>
CC: steve.antioch at gmail.com <steve.antioch at gmail.com>, Modules Issues <modules at crosswire.org>

> Dear SWORD Developers (and our Modules Team),
>
> While watching the [livestream funeral](https://www.youtube.com/live/zC4hXOgqBak?si=JZ7JiM7j_fHW-sQl) of OT Scholar the late Gordon D Wenham yesterday (St Mary's Church, Charlton Kings), I had a bright idea.
>
> I'd been working recently on potential improvements for the KhmerNT module relating to marking the Lexical Word Divisions.
> Khmer is one of the languages of SE Asia whose Writing System (aka Script) largely has NO SPACE BETWEEN WORDS.
> Others include: Lao, Thai, Myanmar (aka Burmese), together with other languages in the region that employ one of these scripts (e.g. Isaan).
>
> Until the present, the KhmerNT module makes use of the ZWSP = Zero Width Space to mark lexical word boundaries.
> This helps with SWORD search for whole words, because even though the divisions between words are invisible to human eyes, they are accessible to computer software.
>
> Wouldn't it be nice if ... (cue to sing the melody by the Beach Boys) 🎶
>
> - We could instead use a visible Unicode character
> - That character could be hidden by means of an existing SWORD filter
>
> There is such a character!!!
>
> - U+2019 is one of the codepoints hidden (or changed) by the filter UTF8GreekAccents.
>
>> U+2019 (RIGHT SINGLE QUOTATION MARK) is commonly used in digital editions of the NT Greek as the apostrophe, not as a quotation mark.
>>
>> In NT Greek, it appears in:
>>
>> - Elisions: When a vowel at the end of a word is dropped (e.g., δι’ instead of διά before a vowel).
>> - Contractions or abbreviations: e.g., ἐπ’ for ἐπί, καθ’ for κατά.
>> While U+2019 is typographically correct for apostrophes in modern typesetting, some older or simpler digital texts may use U+0027 (straight apostrophe). However, U+2019 is the preferred character in high-quality, properly typeset Greek texts.
>
> I then set about to test my idea by making a further update to an already eXperimental version of the module, provisionally named KhmerNTx.
>
> It "worked like a dream". 😎
>
> With Greek accents hidden, the text looks like this:
>
>> ខ្ញុំពេត្រុស ជាសាវករបស់ព្រះយេស៊ូគ្រិស្ដ ជូនចំពោះពួកអ្នកដែលព្រះជាម្ចាស់បានជ្រើសរើស ហើយដែលបានបែកខ្ញែកគ្នាទៅស្នាក់នៅបណ្ដោះអាសន្ននៅស្រុកប៉ុនតុស ស្រុកកាឡាទី ស្រុកកាប៉ាដូគា ស្រុកអាស៊ី និងស្រុកប៉ីធូនា (I Peter 1:1 [KhmerNTx])
>
> With Greek accents displayed, the text looks like this:
>
>> ខ្ញុំ’ពេត្រុស ជា’សាវក’របស់’ព្រះ’យេស៊ូ’គ្រិស្ដ ជូន’ចំពោះ’ពួកអ្នក’ដែល’ព្រះជាម្ចាស់’បាន’ជ្រើសរើស ហើយ’ដែល’បាន’បែកខ្ញែក’គ្នា’ទៅ’ស្នាក់’នៅ’បណ្ដោះអាសន្ន’នៅ’ស្រុក’ប៉ុនតុស ស្រុក’កាឡាទី ស្រុក’កាប៉ាដូគា ស្រុក’អាស៊ី និង’ស្រុក’ប៉ីធូនា (I Peter 1:1 [KhmerNTx])
>
> I have attached the compressed module for any of you to explore & play with further.
>
> Aside: The previous update already made use of the OSIS XML w element to enclose each lexical Khmer word. That remains the case.
> In this way, the module source text is ready to be adaptedforfurther enhancements such as adding Strong's numbers, etc, to make a Study Edition.
>
> Steve Hyde and the translators in Cambodia are currently preparing to publish the complete Khmer Bible.
> He has requested my assistance in improving the actual word divisions for the 39 OT books.
> I've already been sent the source text, exported from their database.
>
> Since early May, I have been exploring how the Grok AI engine can make a positive contribution to the success of this challenging task.
> More on that subject later.
>
> Best regards,
>
> David
>
> Sent with [Proton Mail](https://pr.tn/ref/SWXT9A5YZ67G) secure email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20250529/5ea492e1/attachment.htm>


More information about the sword-devel mailing list