[sword-devel] Bible in Myanmar

Michael H cmahte at gmail.com
Wed May 15 09:33:10 MST 2019


I've been working long hours and emailing in my break time.  David has the
basics of converting to VPL.

I would then make the entire work a column in a spreadsheet.

Then in other collumns insert a list of Book/chapter/verse in order.

The BCV and versetext  columns should align and can be verified, and
adjusted where things don't match perfectly, like maybe 3 John has 15
instead of 14 verses.

Once the columns align, you can merge them into another column via
concatenation operations (&).  This last column becomes your output.

The output needs to consider that section titles and section ranges belong
in front of the verse marker. That is a bit more complex search and
replace, but can be done successfully.



On Wed, May 15, 2019 at 11:12 AM David Haslam <dfhdfh at protonmail.com> wrote:

> The attachment contains a counted list of Myanmar words containing a font
> conversion error.
> *NB. We need to match these words with what they are in the legacy font.*
>
> This issue should be discussed with the current maintainer of the SIL
> *TECkit* converter, whoever that may be.
>
> It may be worthwhile asking our friends at the SIL *Writing Systems
> Technology* team. See
> https://scripts.sil.org/default
>
> *Aside: My friend Martin Hosken of SIL knew the late Keith Stribley - the
> former webmaster of ThanLwinSoft.*
>
> Best regards,
>
> David
>
> Sent with ProtonMail <https://protonmail.com> Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, May 15, 2019 4:41 PM, David Haslam <dfhdfh at protonmail.com>
> wrote:
>
> *Observations: (continued)*
>
> 5. The string "*Kd;*" also looks anomalous. It's found only once in
> ကိုယ်တော်၏ဦးခေါင်းတော်အပေါ်၌ လည်း ဤသူသည်ကား ဂျူးလူမျ Kd;တို့၏ဘုရင်၊
>
> 6. It's evident from the PDF file that the text is paragraphed with
> indented first lines. See
>
> https://www.dropbox.com/s/do5e675i19xfomf/Screenshot%202019-05-15%2016.29.10.png?dl=0
>
> My hunch is that these leading paragraph indents may have been coded
> within contents.xml as the self-closing element *<text:tab/>*. There are
> 372 matches to this.
>
> So not only do we need to provide chapter and verse tags (plus section
> headings & parallel passage titles, etc), we also need to reconstruct all
> the paragraph tags.
>
> *NB. All structural XML indents were removed by the filter "Remove blanks
> at SOL" in the file **contents.pp.tx** that was output by my simple
> TextPipe filter. So that's quite a different matter.*
>
> Best regards,
>
> David
>
> Sent with ProtonMail <https://protonmail.com> Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, May 15, 2019 2:22 PM, David Haslam <dfhdfh at protonmail.com>
> wrote:
>
> *Observations: (continued)*
>
> 4. In addition to the reported instances of the anomalous 3 characters (
> *È,Ø,ò*) found after the font conversion,
> there are 6 instances of the string "*m;*" that are also probably due to
> bugs in the converter.
>
> Best regards,
>
> David
>
> Sent with ProtonMail <https://protonmail.com> Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, May 15, 2019 12:41 PM, David Haslam <dfhdfh at protonmail.com>
> wrote:
>
> Yep - sure - later I can do that.
>
> David
>
> Sent from ProtonMail Mobile
>
>
> On Wed, May 15, 2019 at 11:26, Cyrille <lafricain79 at gmail.com> wrote:
>
> David I have no count in box, and I want not to create one. Can you push
> on https://framadrop.org/ it's totally free and secure (and private).
> Thank  you.
>
>
> Il 15/05/2019 11:46, David Haslam ha scritto:
>
> Interim progress report.
>
> I downloaded the file Mat_utf8.zip from Cyrille's link and unzipped the contents to Mat_utf8-odt
>
> I opened the .odt file using 7-Zip from the Windows Explorer context menu, and extracted the file contents.xml
>
> I used Notepad++ plug-in XMLTools to pretty print the XML file and saved it as contents.pp.xml
> This is simply a layout change that's easier to read.
>
> I viewed the .pp.xml file in BabelPad, which confirmed that the non-XML text was (mostly) Myanmar Unicode.
>
> I used a TextPipe filter to remove all XML tags, blanks from SOL & EOL and all blank lines.
> The output file is now contents.pp.txt
>
> This is now something that's readable content in Myanmar Unicode, with some English text such as "The Gospel according Matthew" near the start.
>
> The file is best viewed using BabelPad with the option Display Colours | Colour Code by Script.
> This shows Myanmar characters in light green, and non-Myanmar characters in other colours.
>
> Observations:
> 1. The font conversion to Unicode left a few scattered characters unconverted. :(
>
> 0000C8	È	18	LATIN CAPITAL LETTER E WITH GRAVE
> 0000D8	Ø	20	LATIN CAPITAL LETTER O WITH STROKE
> 0000F2	ò	3	LATIN SMALL LETTER O WITH GRAVE
>
> The complete character frequency analysis is attached.
>
> 2. A few verse numbers? are still present here and there.
> 3. The content contains section headings and parallel passage headings as well as verse text.
>
> I have just uploaded the file contents.pp.zip to a new folder in my Box account and added Cyrille & Michael as viewers.
>
>
> Best regards,
>
> David
>
> Sent with ProtonMail Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Monday, May 13, 2019 9:19 AM, Cyrille <lafricain79 at gmail.com> <lafricain79 at gmail.com> wrote:
>
>
> Hello,
> I recently receive a modern translation of Myanmar of the NT, Psalms and
> Proverbs with permission to create a new module.
> But the problems are many... Firs to get the text.
> I tested different way, but it's done with PageMaker!
> I can get the text but the problem is I don't have the verses number
> because they are next in a parallel column and when I copy it I have
> only the biblical text.
> I have a pdf also but when I convert it to text (with pdftotext) the
> columns are mixed.
> Someone can help me whit any idea?
> Next problem is the Unicode... The text is not typed in unicode but use
> a special font.
> I can send everything you need or push it the git.crosswire.
>
> Thanks for help.
>
> sword-devel mailing list: sword-devel at crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.orghttp://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
>
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190515/ce41d98c/attachment.html>


More information about the sword-devel mailing list