[sword-devel] Bible in Myanmar
Cyrille
lafricain79 at gmail.com
Wed May 15 11:43:16 MST 2019
Il 15/05/2019 19:18, David Haslam ha scritto:
> Each of the last 1 or 2 characters of each verse is a regular Myanmar
> punctuation mark.
>
Do you know wich mark?
> We need to be careful how we apply this. There may well be
> some exceptions.
>
> Windows users should install BabelPad. This free Unicode text editor
> is highly recommended.
>
> http://www.babelstone.co.uk/Software/BabelPad.html
>
> It will help in all sorts of ways, not least in analysis.
>
> David
>
> Sent from ProtonMail Mobile
>
>
> On Wed, May 15, 2019 at 18:08, Cyrille <lafricain79 at gmail.com
> <mailto:lafricain79 at gmail.com>> wrote:
>> I have not understood everything yet ... But I trust you. But if you
>> have the courage to explain to me I want to learn :)
>> What I don't understand is how you can find the marker of each verse
>> and chapter in the utf8 text? What is this marker in question?
>>
>> Il 15/05/2019 19:03, David Haslam ha scritto:
>>> Michael’s description matches how I imagined the method
>>> during my waking moments this morning. :)
>>>
>>> David
>>>
>>> Sent from ProtonMail Mobile
>>>
>>>
>>> On Wed, May 15, 2019 at 17:33, Michael H <cmahte at gmail.com
>>> <mailto:cmahte at gmail.com>> wrote:
>>>> I've been working long hours and emailing in my break time. David
>>>> has the basics of converting to VPL.
>>>>
>>>> I would then make the entire work a column in a spreadsheet.
>>>>
>>>> Then in other collumns insert a list of Book/chapter/verse in order.
>>>>
>>>> The BCV and versetext columns should align and can be verified,
>>>> and adjusted where things don't match perfectly, like maybe 3 John
>>>> has 15 instead of 14 verses.
>>>>
>>>> Once the columns align, you can merge them into another column via
>>>> concatenation operations (&). This last column becomes your output.
>>>>
>>>> The output needs to consider that section titles and section ranges
>>>> belong in front of the verse marker. That is a bit more complex
>>>> search and replace, but can be done successfully.
>>>>
>>>>
>>>>
>>>> On Wed, May 15, 2019 at 11:12 AM David Haslam
>>>> <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>
>>>> The attachment contains a counted list of Myanmar words
>>>> containing a font conversion error.
>>>> /NB. We need to match these words with what they are in the
>>>> legacy font./
>>>>
>>>> This issue should be discussed with the current maintainer of
>>>> the SIL *TECkit* converter, whoever that may be.
>>>>
>>>> It may be worthwhile asking our friends at the SIL *Writing
>>>> Systems Technology* team. See
>>>> https://scripts.sil.org/default
>>>>
>>>> /Aside: My friend Martin Hosken of SIL knew the late Keith
>>>> Stribley - the former webmaster of ThanLwinSoft./
>>>>
>>>> Best regards,
>>>>
>>>> David
>>>>
>>>> Sent with ProtonMail <https://protonmail.com> Secure Email.
>>>>
>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>> On Wednesday, May 15, 2019 4:41 PM, David Haslam
>>>> <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>
>>>>> _*Observations*: (continued)_
>>>>>
>>>>> 5. The string "*Kd;*" also looks anomalous. It's found only
>>>>> once in
>>>>> ကိုယ်တော်၏ဦးခေါင်းတော်အပေါ်၌ လည်း ဤသူသည်ကား ဂျူးလူမျ Kd;တို့၏ဘုရင်၊
>>>>>
>>>>> 6. It's evident from the PDF file that the text is paragraphed
>>>>> with indented first lines. See
>>>>> https://www.dropbox.com/s/do5e675i19xfomf/Screenshot%202019-05-15%2016.29.10.png?dl=0
>>>>>
>>>>> My hunch is that these leading paragraph indents may have been
>>>>> coded within contents.xml as the self-closing
>>>>> element *<text:tab/>*. There are 372 matches to this.
>>>>>
>>>>> So not only do we need to provide chapter and verse tags (plus
>>>>> section headings & parallel passage titles, etc), we also need
>>>>> to reconstruct all the paragraph tags.
>>>>>
>>>>> /NB. All structural XML indents were removed by the filter
>>>>> "Remove blanks at SOL" in the file /*/contents.pp.tx/*/that
>>>>> was output by my simple TextPipe filter. So that's quite a
>>>>> different matter./
>>>>>
>>>>> Best regards,
>>>>>
>>>>> David
>>>>>
>>>>> Sent with ProtonMail <https://protonmail.com> Secure Email.
>>>>>
>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>> On Wednesday, May 15, 2019 2:22 PM, David Haslam
>>>>> <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>>
>>>>>> _*Observations:* (continued*)*_
>>>>>>
>>>>>> 4. In addition to the reported instances of the anomalous 3
>>>>>> characters (*È,Ø,ò*) found after the font conversion,
>>>>>> there are 6 instances of the string "*m;*" that are
>>>>>> also probably due to bugs in the converter.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> David
>>>>>>
>>>>>> Sent with ProtonMail <https://protonmail.com> Secure Email.
>>>>>>
>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>> On Wednesday, May 15, 2019 12:41 PM, David Haslam
>>>>>> <dfhdfh at protonmail.com <mailto:dfhdfh at protonmail.com>> wrote:
>>>>>>
>>>>>>> Yep - sure - later I can do that.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> Sent from ProtonMail Mobile
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 15, 2019 at 11:26, Cyrille
>>>>>>> <lafricain79 at gmail.com <mailto:lafricain79 at gmail.com>> wrote:
>>>>>>>> David I have no count in box, and I want not to create one.
>>>>>>>> Can you push on https://framadrop.org/ it's totally free
>>>>>>>> and secure (and private).
>>>>>>>> Thank you.
>>>>>>>>
>>>>>>>>
>>>>>>>> Il 15/05/2019 11:46, David Haslam ha scritto:
>>>>>>>>> Interim progress report.
>>>>>>>>>
>>>>>>>>> I downloaded the file Mat_utf8.zip from Cyrille's link and unzipped the contents to Mat_utf8-odt
>>>>>>>>>
>>>>>>>>> I opened the .odt file using 7-Zip from the Windows Explorer context menu, and extracted the file contents.xml
>>>>>>>>>
>>>>>>>>> I used Notepad++ plug-in XMLTools to pretty print the XML file and saved it as contents.pp.xml
>>>>>>>>> This is simply a layout change that's easier to read.
>>>>>>>>>
>>>>>>>>> I viewed the .pp.xml file in BabelPad, which confirmed that the non-XML text was (mostly) Myanmar Unicode.
>>>>>>>>>
>>>>>>>>> I used a TextPipe filter to remove all XML tags, blanks from SOL & EOL and all blank lines.
>>>>>>>>> The output file is now contents.pp.txt
>>>>>>>>>
>>>>>>>>> This is now something that's readable content in Myanmar Unicode, with some English text such as "The Gospel according Matthew" near the start.
>>>>>>>>>
>>>>>>>>> The file is best viewed using BabelPad with the option Display Colours | Colour Code by Script.
>>>>>>>>> This shows Myanmar characters in light green, and non-Myanmar characters in other colours.
>>>>>>>>>
>>>>>>>>> Observations:
>>>>>>>>> 1. The font conversion to Unicode left a few scattered characters unconverted. :(
>>>>>>>>>
>>>>>>>>> 0000C8 È 18 LATIN CAPITAL LETTER E WITH GRAVE
>>>>>>>>> 0000D8 Ø 20 LATIN CAPITAL LETTER O WITH STROKE
>>>>>>>>> 0000F2 ò 3 LATIN SMALL LETTER O WITH GRAVE
>>>>>>>>>
>>>>>>>>> The complete character frequency analysis is attached.
>>>>>>>>>
>>>>>>>>> 2. A few verse numbers? are still present here and there.
>>>>>>>>> 3. The content contains section headings and parallel passage headings as well as verse text.
>>>>>>>>>
>>>>>>>>> I have just uploaded the file contents.pp.zip to a new folder in my Box account and added Cyrille & Michael as viewers.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>>>
>>>>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>>>>> On Monday, May 13, 2019 9:19 AM, Cyrille <lafricain79 at gmail.com> <mailto:lafricain79 at gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>> I recently receive a modern translation of Myanmar of the NT, Psalms and
>>>>>>>>>> Proverbs with permission to create a new module.
>>>>>>>>>> But the problems are many... Firs to get the text.
>>>>>>>>>> I tested different way, but it's done with PageMaker!
>>>>>>>>>> I can get the text but the problem is I don't have the verses number
>>>>>>>>>> because they are next in a parallel column and when I copy it I have
>>>>>>>>>> only the biblical text.
>>>>>>>>>> I have a pdf also but when I convert it to text (with pdftotext) the
>>>>>>>>>> columns are mixed.
>>>>>>>>>> Someone can help me whit any idea?
>>>>>>>>>> Next problem is the Unicode... The text is not typed in unicode but use
>>>>>>>>>> a special font.
>>>>>>>>>> I can send everything you need or push it the git.crosswire.
>>>>>>>>>>
>>>>>>>>>> Thanks for help.
>>>>>>>>>>
>>>>>>>>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>> <mailto:sword-devel at crosswire.org>
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190515/b0c3b6e3/attachment-0001.html>
More information about the sword-devel
mailing list