[sword-devel] Bible in Myanmar

Wed May 15 11:43:16 MST 2019


Il 15/05/2019 19:18, David Haslam ha scritto:
> Each of the last 1 or 2 characters of each verse is a regular Myanmar
> punctuation mark.
>
Do you know wich mark?
> We need to be careful how we apply this.  There may well be
> some exceptions.
>
> Windows users should install BabelPad. This free Unicode text editor
> is highly recommended.
>
> http://www.babelstone.co.uk/Software/BabelPad.html
>
> It will help in all sorts of ways, not least in analysis.
>
> David
>
> Sent from ProtonMail Mobile
>
>
> On Wed, May 15, 2019 at 18:08, Cyrille <lafricain79 at gmail.com
> <mailto:lafricain79 at gmail.com>> wrote:
>> I have not understood everything yet ... But I trust you. But if you
>> have the courage to explain to me I want to learn :)
>> What I don't understand is how you can find the marker of each verse
>> and chapter in the utf8 text? What is this marker in question?
>>
>> Il 15/05/2019 19:03, David Haslam ha scritto:
>>> Michael’s description matches how I imagined the method
>>> during my waking moments this morning. :)
>>>
>>> David
>>>
>>> Sent from ProtonMail Mobile
>>>
>>>
>>> On Wed, May 15, 2019 at 17:33, Michael H <cmahte at gmail.com
>>> <mailto:cmahte at gmail.com>> wrote:
>>>> I've been working long hours and emailing in my break time.  David
>>>> has the basics of converting to VPL.  
>>>>
>>>> I would then make the entire work a column in a spreadsheet. 
>>>>
>>>> Then in other collumns insert a list of Book/chapter/verse in order. 
>>>>
>>>> The BCV and versetext  columns should align and can be verified,
>>>> and adjusted where things don't match perfectly, like maybe 3 John
>>>> has 15 instead of 14 verses. 
>>>>
>>>> Once the columns align, you can merge them into another column via
>>>> concatenation operations (&).  This last column becomes your output. 
>>>>
>>>> The output needs to consider that section titles and section ranges
>>>> belong in front of the verse marker. That is a bit more complex
>>>> search and replace, but can be done successfully. 
>>>>
>>>>
>>>>
>>>> On Wed, May 15, 2019 at 11:12 AM David Haslam
>>>> <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>
>>>>     The attachment contains a counted list of Myanmar words
>>>>     containing a font conversion error.
>>>>     /NB. We need to match these words with what they are in the
>>>>     legacy font./
>>>>
>>>>     This issue should be discussed with the current maintainer of
>>>>     the SIL *TECkit* converter, whoever that may be.
>>>>
>>>>     It may be worthwhile asking our friends at the SIL *Writing
>>>>     Systems Technology* team. See
>>>>     https://scripts.sil.org/default
>>>>
>>>>     /Aside: My friend Martin Hosken of SIL knew the late Keith
>>>>     Stribley - the former webmaster of ThanLwinSoft./
>>>>
>>>>     Best regards,
>>>>
>>>>     David
>>>>
>>>>     Sent with ProtonMail <https://protonmail.com> Secure Email.
>>>>
>>>>     ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>     On Wednesday, May 15, 2019 4:41 PM, David Haslam
>>>>     <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>
>>>>>     _*Observations*: (continued)_
>>>>>
>>>>>     5. The string "*Kd;*" also looks anomalous. It's found only
>>>>>     once in 
>>>>>     ကိုယ်တော်၏ဦးခေါင်းတော်အပေါ်၌ လည်း ဤသူသည်ကား ဂျူးလူမျ Kd;တို့၏ဘုရင်၊
>>>>>
>>>>>     6. It's evident from the PDF file that the text is paragraphed
>>>>>     with indented first lines. See 
>>>>>     https://www.dropbox.com/s/do5e675i19xfomf/Screenshot%202019-05-15%2016.29.10.png?dl=0
>>>>>
>>>>>     My hunch is that these leading paragraph indents may have been
>>>>>     coded within contents.xml as the self-closing
>>>>>     element *<text:tab/>*. There are 372 matches to this.
>>>>>
>>>>>     So not only do we need to provide chapter and verse tags (plus
>>>>>     section headings & parallel passage titles, etc), we also need
>>>>>     to reconstruct all the paragraph tags.
>>>>>
>>>>>     /NB. All structural XML indents were removed by the filter
>>>>>     "Remove blanks at SOL" in the file /*/contents.pp.tx/*/that
>>>>>     was output by my simple TextPipe filter. So that's quite a
>>>>>     different matter./
>>>>>
>>>>>     Best regards,
>>>>>
>>>>>     David
>>>>>
>>>>>     Sent with ProtonMail <https://protonmail.com> Secure Email.
>>>>>
>>>>>     ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>     On Wednesday, May 15, 2019 2:22 PM, David Haslam
>>>>>     <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>>
>>>>>>     _*Observations:* (continued*)*_
>>>>>>
>>>>>>     4. In addition to the reported instances of the anomalous 3
>>>>>>     characters (*È,Ø,ò*) found after the font conversion,
>>>>>>     there are 6 instances of the string "*m;*" that are
>>>>>>     also probably due to bugs in the converter.
>>>>>>
>>>>>>     Best regards,
>>>>>>
>>>>>>     David
>>>>>>
>>>>>>     Sent with ProtonMail <https://protonmail.com> Secure Email.
>>>>>>
>>>>>>     ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>>     On Wednesday, May 15, 2019 12:41 PM, David Haslam
>>>>>>     <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>>>
>>>>>>>     Yep - sure - later I can do that. 
>>>>>>>
>>>>>>>     David
>>>>>>>
>>>>>>>     Sent from ProtonMail Mobile
>>>>>>>
>>>>>>>
>>>>>>>     On Wed, May 15, 2019 at 11:26, Cyrille
>>>>>>>     <lafricain79 at gmail.com <mailto:lafricain79 at gmail.com>> wrote:
>>>>>>>>     David I have no count in box, and I want not to create one.
>>>>>>>>     Can you push on https://framadrop.org/ it's totally free
>>>>>>>>     and secure (and private).
>>>>>>>>     Thank  you.
>>>>>>>>
>>>>>>>>
>>>>>>>>     Il 15/05/2019 11:46, David Haslam ha scritto:
>>>>>>>>>     Interim progress report.
>>>>>>>>>
>>>>>>>>>     I downloaded the file Mat_utf8.zip from Cyrille's link and unzipped the contents to Mat_utf8-odt
>>>>>>>>>
>>>>>>>>>     I opened the .odt file using 7-Zip from the Windows Explorer context menu, and extracted the file contents.xml
>>>>>>>>>
>>>>>>>>>     I used Notepad++ plug-in XMLTools to pretty print the XML file and saved it as contents.pp.xml
>>>>>>>>>     This is simply a layout change that's easier to read.
>>>>>>>>>
>>>>>>>>>     I viewed the .pp.xml file in BabelPad, which confirmed that the non-XML text was (mostly) Myanmar Unicode.
>>>>>>>>>
>>>>>>>>>     I used a TextPipe filter to remove all XML tags, blanks from SOL & EOL and all blank lines.
>>>>>>>>>     The output file is now contents.pp.txt
>>>>>>>>>
>>>>>>>>>     This is now something that's readable content in Myanmar Unicode, with some English text such as "The Gospel according Matthew" near the start.
>>>>>>>>>
>>>>>>>>>     The file is best viewed using BabelPad with the option Display Colours | Colour Code by Script.
>>>>>>>>>     This shows Myanmar characters in light green, and non-Myanmar characters in other colours.
>>>>>>>>>
>>>>>>>>>     Observations:
>>>>>>>>>     1. The font conversion to Unicode left a few scattered characters unconverted. :(
>>>>>>>>>
>>>>>>>>>     0000C8	È	18	LATIN CAPITAL LETTER E WITH GRAVE
>>>>>>>>>     0000D8	Ø	20	LATIN CAPITAL LETTER O WITH STROKE
>>>>>>>>>     0000F2	ò	3	LATIN SMALL LETTER O WITH GRAVE
>>>>>>>>>
>>>>>>>>>     The complete character frequency analysis is attached.
>>>>>>>>>
>>>>>>>>>     2. A few verse numbers? are still present here and there.
>>>>>>>>>     3. The content contains section headings and parallel passage headings as well as verse text.
>>>>>>>>>
>>>>>>>>>     I have just uploaded the file contents.pp.zip to a new folder in my Box account and added Cyrille & Michael as viewers.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     Best regards,
>>>>>>>>>
>>>>>>>>>     David
>>>>>>>>>
>>>>>>>>>     Sent with ProtonMail Secure Email.
>>>>>>>>>
>>>>>>>>>     ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>>>>>     On Monday, May 13, 2019 9:19 AM, Cyrille <lafricain79 at gmail.com> <mailto:lafricain79 at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>     Hello,
>>>>>>>>>>     I recently receive a modern translation of Myanmar of the NT, Psalms and
>>>>>>>>>>     Proverbs with permission to create a new module.
>>>>>>>>>>     But the problems are many... Firs to get the text.
>>>>>>>>>>     I tested different way, but it's done with PageMaker!
>>>>>>>>>>     I can get the text but the problem is I don't have the verses number
>>>>>>>>>>     because they are next in a parallel column and when I copy it I have
>>>>>>>>>>     only the biblical text.
>>>>>>>>>>     I have a pdf also but when I convert it to text (with pdftotext) the
>>>>>>>>>>     columns are mixed.
>>>>>>>>>>     Someone can help me whit any idea?
>>>>>>>>>>     Next problem is the Unicode... The text is not typed in unicode but use
>>>>>>>>>>     a special font.
>>>>>>>>>>     I can send everything you need or push it the git.crosswire.
>>>>>>>>>>
>>>>>>>>>>     Thanks for help.
>>>>>>>>>>
>>>>>>>>>>     sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>>>>>>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>>>>     Instructions to unsubscribe/change your settings at above page
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     _______________________________________________
>>>>>>>>>     sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>>>>>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>>>     Instructions to unsubscribe/change your settings at above page
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>     _______________________________________________
>>>>     sword-devel mailing list: sword-devel at crosswire.org
>>>>     <mailto:sword-devel at crosswire.org>
>>>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>     Instructions to unsubscribe/change your settings at above page
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190515/b0c3b6e3/attachment-0001.html>