[sword-devel] Bible in Myanmar

David Haslam dfhdfh at protonmail.com
Wed May 15 10:18:47 MST 2019


Each of the last 1 or 2 characters of each verse is a regular Myanmar punctuation mark.

We need to be careful how we apply this.  There may well be some exceptions.

Windows users should install BabelPad. This free Unicode text editor is highly recommended.

http://www.babelstone.co.uk/Software/BabelPad.html

It will help in all sorts of ways, not least in analysis.

David

Sent from ProtonMail Mobile

On Wed, May 15, 2019 at 18:08, Cyrille <lafricain79 at gmail.com> wrote:

> I have not understood everything yet ... But I trust you. But if you have the courage to explain to me I want to learn :)
> What I don't understand is how you can find the marker of each verse and chapter in the utf8 text? What is this marker in question?
>
> Il 15/05/2019 19:03, David Haslam ha scritto:
>
>> Michael’s description matches how I imagined the method during my waking moments this morning. :)
>>
>> David
>>
>> Sent from ProtonMail Mobile
>>
>> On Wed, May 15, 2019 at 17:33, Michael H <cmahte at gmail.com> wrote:
>>
>>> I've been working long hours and emailing in my break time.  David has the basics of converting to VPL.
>>>
>>> I would then make the entire work a column in a spreadsheet.
>>>
>>> Then in other collumns insert a list of Book/chapter/verse in order.
>>>
>>> The BCV and versetext  columns should align and can be verified, and adjusted where things don't match perfectly, like maybe 3 John has 15 instead of 14 verses.
>>>
>>> Once the columns align, you can merge them into another column via concatenation operations (&).  This last column becomes your output.
>>>
>>> The output needs to consider that section titles and section ranges belong in front of the verse marker. That is a bit more complex search and replace, but can be done successfully.
>>>
>>> On Wed, May 15, 2019 at 11:12 AM David Haslam <dfhdfh at protonmail.com> wrote:
>>>
>>>> The attachment contains a counted list of Myanmar words containing a font conversion error.
>>>> NB. We need to match these words with what they are in the legacy font.
>>>>
>>>> This issue should be discussed with the current maintainer of the SIL TECkit converter, whoever that may be.
>>>>
>>>> It may be worthwhile asking our friends at the SIL Writing Systems Technology team. See https://scripts.sil.org/default
>>>>
>>>> Aside: My friend Martin Hosken of SIL knew the late Keith Stribley - the former webmaster of ThanLwinSoft.
>>>>
>>>> Best regards,
>>>>
>>>> David
>>>>
>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>>>
>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>> On Wednesday, May 15, 2019 4:41 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>
>>>>> Observations: (continued)
>>>>>
>>>>> 5. The string "Kd;" also looks anomalous. It's found only once in
>>>>> ကိုယ်တော်၏ဦးခေါင်းတော်အပေါ်၌ လည်း ဤသူသည်ကား ဂျူးလူမျ Kd;တို့၏ဘုရင်၊
>>>>>
>>>>> 6. It's evident from the PDF file that the text is paragraphed with indented first lines. See
>>>>> https://www.dropbox.com/s/do5e675i19xfomf/Screenshot%202019-05-15%2016.29.10.png?dl=0
>>>>>
>>>>> My hunch is that these leading paragraph indents may have been coded within contents.xml as the self-closing element <text:tab/>. There are 372 matches to this.
>>>>>
>>>>> So not only do we need to provide chapter and verse tags (plus section headings & parallel passage titles, etc), we also need to reconstruct all the paragraph tags.
>>>>>
>>>>> NB. All structural XML indents were removed by the filter "Remove blanks at SOL" in the file contents.pp.tx that was output by my simple TextPipe filter. So that's quite a different matter.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> David
>>>>>
>>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>>>>
>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>> On Wednesday, May 15, 2019 2:22 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>>
>>>>>> Observations: (continued)
>>>>>>
>>>>>> 4. In addition to the reported instances of the anomalous 3 characters (È,Ø,ò) found after the font conversion,
>>>>>> there are 6 instances of the string "m;" that are also probably due to bugs in the converter.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> David
>>>>>>
>>>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>>>>>
>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>> On Wednesday, May 15, 2019 12:41 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>>>
>>>>>>> Yep - sure - later I can do that.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> Sent from ProtonMail Mobile
>>>>>>>
>>>>>>> On Wed, May 15, 2019 at 11:26, Cyrille <lafricain79 at gmail.com> wrote:
>>>>>>>
>>>>>>>> David I have no count in box, and I want not to create one. Can you push on https://framadrop.org/ it's totally free and secure (and private).
>>>>>>>> Thank  you.
>>>>>>>>
>>>>>>>> Il 15/05/2019 11:46, David Haslam ha scritto:
>>>>>>>>
>>>>>>>>> Interim progress report.
>>>>>>>>>
>>>>>>>>> I downloaded the file Mat_utf8.zip from Cyrille's link and unzipped the contents to Mat_utf8-odt
>>>>>>>>>
>>>>>>>>> I opened the .odt file using 7-Zip from the Windows Explorer context menu, and extracted the file contents.xml
>>>>>>>>>
>>>>>>>>> I used Notepad++ plug-in XMLTools to pretty print the XML file and saved it as contents.pp.xml
>>>>>>>>> This is simply a layout change that's easier to read.
>>>>>>>>>
>>>>>>>>> I viewed the .pp.xml file in BabelPad, which confirmed that the non-XML text was (mostly) Myanmar Unicode.
>>>>>>>>>
>>>>>>>>> I used a TextPipe filter to remove all XML tags, blanks from SOL & EOL and all blank lines.
>>>>>>>>> The output file is now contents.pp.txt
>>>>>>>>>
>>>>>>>>> This is now something that's readable content in Myanmar Unicode, with some English text such as "The Gospel according Matthew" near the start.
>>>>>>>>>
>>>>>>>>> The file is best viewed using BabelPad with the option Display Colours | Colour Code by Script.
>>>>>>>>> This shows Myanmar characters in light green, and non-Myanmar characters in other colours.
>>>>>>>>>
>>>>>>>>> Observations:
>>>>>>>>> 1. The font conversion to Unicode left a few scattered characters unconverted. :(
>>>>>>>>>
>>>>>>>>> 0000C8	È	18	LATIN CAPITAL LETTER E WITH GRAVE
>>>>>>>>> 0000D8	Ø	20	LATIN CAPITAL LETTER O WITH STROKE
>>>>>>>>> 0000F2	ò	3	LATIN SMALL LETTER O WITH GRAVE
>>>>>>>>>
>>>>>>>>> The complete character frequency analysis is attached.
>>>>>>>>>
>>>>>>>>> 2. A few verse numbers? are still present here and there.
>>>>>>>>> 3. The content contains section headings and parallel passage headings as well as verse text.
>>>>>>>>>
>>>>>>>>> I have just uploaded the file contents.pp.zip to a new folder in my Box account and added Cyrille & Michael as viewers.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>>>
>>>>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>>>>> On Monday, May 13, 2019 9:19 AM, Cyrille
>>>>>>>>> [<lafricain79 at gmail.com>](mailto:lafricain79 at gmail.com)
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>> I recently receive a modern translation of Myanmar of the NT, Psalms and
>>>>>>>>>> Proverbs with permission to create a new module.
>>>>>>>>>> But the problems are many... Firs to get the text.
>>>>>>>>>> I tested different way, but it's done with PageMaker!
>>>>>>>>>> I can get the text but the problem is I don't have the verses number
>>>>>>>>>> because they are next in a parallel column and when I copy it I have
>>>>>>>>>> only the biblical text.
>>>>>>>>>> I have a pdf also but when I convert it to text (with pdftotext) the
>>>>>>>>>> columns are mixed.
>>>>>>>>>> Someone can help me whit any idea?
>>>>>>>>>> Next problem is the Unicode... The text is not typed in unicode but use
>>>>>>>>>> a special font.
>>>>>>>>>> I can send everything you need or push it the git.crosswire.
>>>>>>>>>>
>>>>>>>>>> Thanks for help.
>>>>>>>>>>
>>>>>>>>>> sword-devel mailing list:
>>>>>>>>>> sword-devel at crosswire.org
>>>>>>>>>>
>>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> sword-devel mailing list:
>>>>>>>>> sword-devel at crosswire.org
>>>>>>>>>
>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>
>> _______________________________________________
>> sword-devel mailing list:
>> sword-devel at crosswire.org
>>
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190515/d3a070c9/attachment-0001.html>


More information about the sword-devel mailing list