[sword-devel] Bible in Myanmar
David Haslam
dfhdfh at protonmail.com
Wed May 15 10:03:47 MST 2019
Michael’s description matches how I imagined the method during my waking moments this morning. :)
David
Sent from ProtonMail Mobile
On Wed, May 15, 2019 at 17:33, Michael H <cmahte at gmail.com> wrote:
> I've been working long hours and emailing in my break time. David has the basics of converting to VPL.
>
> I would then make the entire work a column in a spreadsheet.
>
> Then in other collumns insert a list of Book/chapter/verse in order.
>
> The BCV and versetext columns should align and can be verified, and adjusted where things don't match perfectly, like maybe 3 John has 15 instead of 14 verses.
>
> Once the columns align, you can merge them into another column via concatenation operations (&). This last column becomes your output.
>
> The output needs to consider that section titles and section ranges belong in front of the verse marker. That is a bit more complex search and replace, but can be done successfully.
>
> On Wed, May 15, 2019 at 11:12 AM David Haslam <dfhdfh at protonmail.com> wrote:
>
>> The attachment contains a counted list of Myanmar words containing a font conversion error.
>> NB. We need to match these words with what they are in the legacy font.
>>
>> This issue should be discussed with the current maintainer of the SIL TECkit converter, whoever that may be.
>>
>> It may be worthwhile asking our friends at the SIL Writing Systems Technology team. Seehttps://scripts.sil.org/default
>>
>> Aside: My friend Martin Hosken of SIL knew the late Keith Stribley - the former webmaster of ThanLwinSoft.
>>
>> Best regards,
>>
>> David
>>
>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Wednesday, May 15, 2019 4:41 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>
>>> Observations: (continued)
>>>
>>> 5. The string "Kd;" also looks anomalous. It's found only once in
>>> ကိုယ်တော်၏ဦးခေါင်းတော်အပေါ်၌ လည်း ဤသူသည်ကား ဂျူးလူမျ Kd;တို့၏ဘုရင်၊
>>>
>>> 6. It's evident from the PDF file that the text is paragraphed with indented first lines. See
>>> https://www.dropbox.com/s/do5e675i19xfomf/Screenshot%202019-05-15%2016.29.10.png?dl=0
>>>
>>> My hunch is that these leading paragraph indents may have been coded within contents.xml as the self-closing element <text:tab/>. There are 372 matches to this.
>>>
>>> So not only do we need to provide chapter and verse tags (plus section headings & parallel passage titles, etc), we also need to reconstruct all the paragraph tags.
>>>
>>> NB. All structural XML indents were removed by the filter "Remove blanks at SOL" in the file contents.pp.tx that was output by my simple TextPipe filter. So that's quite a different matter.
>>>
>>> Best regards,
>>>
>>> David
>>>
>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>>
>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>> On Wednesday, May 15, 2019 2:22 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>
>>>> Observations: (continued)
>>>>
>>>> 4. In addition to the reported instances of the anomalous 3 characters (È,Ø,ò) found after the font conversion,
>>>> there are 6 instances of the string "m;" that are also probably due to bugs in the converter.
>>>>
>>>> Best regards,
>>>>
>>>> David
>>>>
>>>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>>>
>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>> On Wednesday, May 15, 2019 12:41 PM, David Haslam <dfhdfh at protonmail.com> wrote:
>>>>
>>>>> Yep - sure - later I can do that.
>>>>>
>>>>> David
>>>>>
>>>>> Sent from ProtonMail Mobile
>>>>>
>>>>> On Wed, May 15, 2019 at 11:26, Cyrille <lafricain79 at gmail.com> wrote:
>>>>>
>>>>>> David I have no count in box, and I want not to create one. Can you push on https://framadrop.org/ it's totally free and secure (and private).
>>>>>> Thank you.
>>>>>>
>>>>>> Il 15/05/2019 11:46, David Haslam ha scritto:
>>>>>>
>>>>>>> Interim progress report.
>>>>>>>
>>>>>>> I downloaded the file Mat_utf8.zip from Cyrille's link and unzipped the contents to Mat_utf8-odt
>>>>>>>
>>>>>>> I opened the .odt file using 7-Zip from the Windows Explorer context menu, and extracted the file contents.xml
>>>>>>>
>>>>>>> I used Notepad++ plug-in XMLTools to pretty print the XML file and saved it as contents.pp.xml
>>>>>>> This is simply a layout change that's easier to read.
>>>>>>>
>>>>>>> I viewed the .pp.xml file in BabelPad, which confirmed that the non-XML text was (mostly) Myanmar Unicode.
>>>>>>>
>>>>>>> I used a TextPipe filter to remove all XML tags, blanks from SOL & EOL and all blank lines.
>>>>>>> The output file is now contents.pp.txt
>>>>>>>
>>>>>>> This is now something that's readable content in Myanmar Unicode, with some English text such as "The Gospel according Matthew" near the start.
>>>>>>>
>>>>>>> The file is best viewed using BabelPad with the option Display Colours | Colour Code by Script.
>>>>>>> This shows Myanmar characters in light green, and non-Myanmar characters in other colours.
>>>>>>>
>>>>>>> Observations:
>>>>>>> 1. The font conversion to Unicode left a few scattered characters unconverted. :(
>>>>>>>
>>>>>>> 0000C8 È 18 LATIN CAPITAL LETTER E WITH GRAVE
>>>>>>> 0000D8 Ø 20 LATIN CAPITAL LETTER O WITH STROKE
>>>>>>> 0000F2 ò 3 LATIN SMALL LETTER O WITH GRAVE
>>>>>>>
>>>>>>> The complete character frequency analysis is attached.
>>>>>>>
>>>>>>> 2. A few verse numbers? are still present here and there.
>>>>>>> 3. The content contains section headings and parallel passage headings as well as verse text.
>>>>>>>
>>>>>>> I have just uploaded the file contents.pp.zip to a new folder in my Box account and added Cyrille & Michael as viewers.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> Sent with ProtonMail Secure Email.
>>>>>>>
>>>>>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>>>>>>> On Monday, May 13, 2019 9:19 AM, Cyrille
>>>>>>> [<lafricain79 at gmail.com>](mailto:lafricain79 at gmail.com)
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>> I recently receive a modern translation of Myanmar of the NT, Psalms and
>>>>>>>> Proverbs with permission to create a new module.
>>>>>>>> But the problems are many... Firs to get the text.
>>>>>>>> I tested different way, but it's done with PageMaker!
>>>>>>>> I can get the text but the problem is I don't have the verses number
>>>>>>>> because they are next in a parallel column and when I copy it I have
>>>>>>>> only the biblical text.
>>>>>>>> I have a pdf also but when I convert it to text (with pdftotext) the
>>>>>>>> columns are mixed.
>>>>>>>> Someone can help me whit any idea?
>>>>>>>> Next problem is the Unicode... The text is not typed in unicode but use
>>>>>>>> a special font.
>>>>>>>> I can send everything you need or push it the git.crosswire.
>>>>>>>>
>>>>>>>> Thanks for help.
>>>>>>>>
>>>>>>>> sword-devel mailing list:
>>>>>>>> sword-devel at crosswire.org
>>>>>>>>
>>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>>> Instructions to unsubscribe/change your settings at above page
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> sword-devel mailing list:
>>>>>>> sword-devel at crosswire.org
>>>>>>>
>>>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>> Instructions to unsubscribe/change your settings at above page
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190515/9c53ad05/attachment-0001.html>
More information about the sword-devel
mailing list