[sword-devel] Bible in Myanmar
David Haslam
dfhdfh at protonmail.com
Thu May 16 01:13:47 MST 2019
Cyrille writes,
"Do you know which mark?"
I've yet to do the more detailed analysis, but these 3 are initial candidates:
U+1038 း 7,959 MYANMAR SIGN VISARGA
U+104A ၊ 601 MYANMAR SIGN LITTLE SECTION
U+104B ။ 1,489 MYANMAR SIGN SECTION
But as I observed before, where each verse ends requires more than a simple "blanket" rule.
cf. There are many more Visarga signs than occur at verse end, just as there are many more commas in the KJV than occur likewise.
Observations: (continued)
Still within the scope of contents.pp.txt derived from Mat_utf8.odt
7. I just found an anomalous 'S' that looks like a further font conversion bug.
ဒါဝိဒ်မင်းကြီးတွင် ဥရိယ၏ဇနီးဖြစ်ခဲ့ဖူးသည့်မိန်းမမှ ဖွားမြင်သောသား ဆောလမွန်၊- ဆောလမွန်၏သား ရေဟိုးဘိုအမ်၊ ရေဟိုးဘိုအမ်၏သား အာဘီဂျ၊ အာဘီဂျ ၏သား အာဆ၊- အာဆ၏သား ဂျေဟိုးရှဖတ်၊ ဂျေဟိုး Sရှဖတ်၏သား ဂျော်ရမ်၊
and also
- သင်တို့သည် အဘယ်ကြောင့် အဝတ်အထည်အဖို့ စိုးရိမ်ကြောင့်ကြနေကြသနည်း။ လယ်ကွင်းပြင်ရ dS
Best regards,
David
Sent with [ProtonMail](https://protonmail.com) Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, May 15, 2019 7:43 PM, Cyrille <lafricain79 at gmail.com> wrote:
> Il 15/05/2019 19:18, David Haslam ha scritto:
>
>> Each of the last 1 or 2 characters of each verse is a regular Myanmar punctuation mark.
>
> Do you know wich mark?
>
>> We need to be careful how we apply this. There may well be some exceptions.
>>
>> Windows users should install BabelPad. This free Unicode text editor is highly recommended.
>>
>> http://www.babelstone.co.uk/Software/BabelPad.html
>>
>> It will help in all sorts of ways, not least in analysis.
>>
>> David
>>
>> Sent from ProtonMail Mobile
>>
>> On Wed, May 15, 2019 at 18:08, Cyrille <lafricain79 at gmail.com> wrote:
>>
>>> I have not understood everything yet ... But I trust you. But if you have the courage to explain to me I want to learn :)
>>> What I don't understand is how you can find the marker of each verse and chapter in the utf8 text? What is this marker in question?
>>>
>>> Il 15/05/2019 19:03, David Haslam ha scritto:
>>>
>>>> Michael’s description matches how I imagined the method during my waking moments this morning. :)
>>>>
>>>> David
>>>>
>>>> Sent from ProtonMail Mobile
>>>>
>>>> On Wed, May 15, 2019 at 17:33, Michael H <cmahte at gmail.com> wrote:
>>>>
>>>>> I've been working long hours and emailing in my break time. David has the basics of converting to VPL.
>>>>>
>>>>> I would then make the entire work a column in a spreadsheet.
>>>>>
>>>>> Then in other collumns insert a list of Book/chapter/verse in order.
>>>>>
>>>>> The BCV and versetext columns should align and can be verified, and adjusted where things don't match perfectly, like maybe 3 John has 15 instead of 14 verses.
>>>>>
>>>>> Once the columns align, you can merge them into another column via concatenation operations (&). This last column becomes your output.
>>>>>
>>>>> The output needs to consider that section titles and section ranges belong in front of the verse marker. That is a bit more complex search and replace, but can be done successfully.
>>>>>
>>>>> On Wed, May 15, 2019 at 11:12 AM David Haslam <dfhdfh at protonmail.com> wrote:
>>>>>
>>>>>> The attachment contains a counted list of Myanmar words containing a font conversion error.
>>>>>> NB. We need to match these words with what they are in the legacy font.
>>>>>>
>>>>>> This issue should be discussed with the current maintainer of the SIL TECkit converter, whoever that may be.
>>>>>>
>>>>>> It may be worthwhile asking our friends at the SIL Writing Systems Technology team. See
>>>>>> https://scripts.sil.org/default
>>>>>>
>>>>>> Aside: My friend Martin Hosken of SIL knew the late Keith Stribley - the former webmaster of ThanLwinSoft.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> David
>>>>>>
>>>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>>>>>
>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>> On Wednesday, May 15, 2019 4:41 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>>>
>>>>>>> Observations: (continued)
>>>>>>>
>>>>>>> 5. The string "Kd;" also looks anomalous. It's found only once in
>>>>>>> ကိုယ်တော်၏ဦးခေါင်းတော်အပေါ်၌ လည်း ဤသူသည်ကား ဂျူးလူမျ Kd;တို့၏ဘုရင်၊
>>>>>>>
>>>>>>> 6. It's evident from the PDF file that the text is paragraphed with indented first lines. See
>>>>>>> https://www.dropbox.com/s/do5e675i19xfomf/Screenshot%202019-05-15%2016.29.10.png?dl=0
>>>>>>>
>>>>>>> My hunch is that these leading paragraph indents may have been coded within contents.xml as the self-closing element <text:tab/>. There are 372 matches to this.
>>>>>>>
>>>>>>> So not only do we need to provide chapter and verse tags (plus section headings & parallel passage titles, etc), we also need to reconstruct all the paragraph tags.
>>>>>>>
>>>>>>> NB. All structural XML indents were removed by the filter "Remove blanks at SOL" in the file contents.pp.tx that was output by my simple TextPipe filter. So that's quite a different matter.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>>>>>>
>>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>>> On Wednesday, May 15, 2019 2:22 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>>>>
>>>>>>>> Observations: (continued)
>>>>>>>>
>>>>>>>> 4. In addition to the reported instances of the anomalous 3 characters (È,Ø,ò) found after the font conversion,
>>>>>>>> there are 6 instances of the string "m;" that are also probably due to bugs in the converter.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>>>>>>>
>>>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>>>> On Wednesday, May 15, 2019 12:41 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>>>>>
>>>>>>>>> Yep - sure - later I can do that.
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> Sent from ProtonMail Mobile
>>>>>>>>>
>>>>>>>>> On Wed, May 15, 2019 at 11:26, Cyrille <lafricain79 at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> David I have no count in box, and I want not to create one. Can you push on https://framadrop.org/ it's totally free and secure (and private).
>>>>>>>>>> Thank you.
>>>>>>>>>>
>>>>>>>>>> Il 15/05/2019 11:46, David Haslam ha scritto:
>>>>>>>>>>
>>>>>>>>>>> Interim progress report.
>>>>>>>>>>>
>>>>>>>>>>> I downloaded the file Mat_utf8.zip from Cyrille's link and unzipped the contents to Mat_utf8-odt
>>>>>>>>>>>
>>>>>>>>>>> I opened the .odt file using 7-Zip from the Windows Explorer context menu, and extracted the file contents.xml
>>>>>>>>>>>
>>>>>>>>>>> I used Notepad++ plug-in XMLTools to pretty print the XML file and saved it as contents.pp.xml
>>>>>>>>>>> This is simply a layout change that's easier to read.
>>>>>>>>>>>
>>>>>>>>>>> I viewed the .pp.xml file in BabelPad, which confirmed that the non-XML text was (mostly) Myanmar Unicode.
>>>>>>>>>>>
>>>>>>>>>>> I used a TextPipe filter to remove all XML tags, blanks from SOL & EOL and all blank lines.
>>>>>>>>>>> The output file is now contents.pp.txt
>>>>>>>>>>>
>>>>>>>>>>> This is now something that's readable content in Myanmar Unicode, with some English text such as "The Gospel according Matthew" near the start.
>>>>>>>>>>>
>>>>>>>>>>> The file is best viewed using BabelPad with the option Display Colours | Colour Code by Script.
>>>>>>>>>>> This shows Myanmar characters in light green, and non-Myanmar characters in other colours.
>>>>>>>>>>>
>>>>>>>>>>> Observations:
>>>>>>>>>>> 1. The font conversion to Unicode left a few scattered characters unconverted. :(
>>>>>>>>>>>
>>>>>>>>>>> 0000C8 È 18 LATIN CAPITAL LETTER E WITH GRAVE
>>>>>>>>>>> 0000D8 Ø 20 LATIN CAPITAL LETTER O WITH STROKE
>>>>>>>>>>> 0000F2 ò 3 LATIN SMALL LETTER O WITH GRAVE
>>>>>>>>>>>
>>>>>>>>>>> The complete character frequency analysis is attached.
>>>>>>>>>>>
>>>>>>>>>>> 2. A few verse numbers? are still present here and there.
>>>>>>>>>>> 3. The content contains section headings and parallel passage headings as well as verse text.
>>>>>>>>>>>
>>>>>>>>>>> I have just uploaded the file contents.pp.zip to a new folder in my Box account and added Cyrille & Michael as viewers.
>>>>>>>>>>>
>>>>>>>>>>> Best regards,
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>>>>>
>>>>>>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>>>>>>> On Monday, May 13, 2019 9:19 AM, Cyrille
>>>>>>>>>>> [<lafricain79 at gmail.com>](mailto:lafricain79 at gmail.com)
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>> I recently receive a modern translation of Myanmar of the NT, Psalms and
>>>>>>>>>>>> Proverbs with permission to create a new module.
>>>>>>>>>>>> But the problems are many... Firs to get the text.
>>>>>>>>>>>> I tested different way, but it's done with PageMaker!
>>>>>>>>>>>> I can get the text but the problem is I don't have the verses number
>>>>>>>>>>>> because they are next in a parallel column and when I copy it I have
>>>>>>>>>>>> only the biblical text.
>>>>>>>>>>>> I have a pdf also but when I convert it to text (with pdftotext) the
>>>>>>>>>>>> columns are mixed.
>>>>>>>>>>>> Someone can help me whit any idea?
>>>>>>>>>>>> Next problem is the Unicode... The text is not typed in unicode but use
>>>>>>>>>>>> a special font.
>>>>>>>>>>>> I can send everything you need or push it the git.crosswire.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for help.
>>>>>>>>>>>>
>>>>>>>>>>>> sword-devel mailing list:
>>>>>>>>>>>> sword-devel at crosswire.org
>>>>>>>>>>>>
>>>>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> sword-devel mailing list:
>>>>>>>>>>> sword-devel at crosswire.org
>>>>>>>>>>>
>>>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>
>>>>>> _______________________________________________
>>>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list:
>>>> sword-devel at crosswire.org
>>>>
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>
>> _______________________________________________
>> sword-devel mailing list:
>> sword-devel at crosswire.org
>>
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190516/73404bd5/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: content.pp.fce.count.txt
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190516/73404bd5/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: content.pp.character.frequency.txt
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190516/73404bd5/attachment-0003.txt>
More information about the sword-devel
mailing list