[sword-devel] Bible in Myanmar

Tue May 14 06:08:10 MST 2019

I send my message again because it was bigger.

The conversion to UTF-8 is 99% solved!! I used a online converter:
https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
or:
http://burglish.my-mm.org/latest/trunk/web/fontconv.htm

See the result here
<https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>.

Now the only problem is how to get the verse and chapter number...

Il 14/05/2019 13:53, Michael H ha scritto:
> Cyrille, (Peter), 
>
> Maybe further discussion on this belongs in Gitlab as issues.  Can I
> get added to this project? 
>
> Here are the first few lines of Matthew copied from the PDF: 
> ------
> &Sifrmaw;OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f
> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d tmvaf
> z;O;D \om;jzp\f / (rmu k2;14)
> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf wG U Ny;D
>
> -----
> And here are the first few lines of Matthew copied from the Pagemaker
> file: 
> -----
> Sifrmaw;OD; {0Ha*vdusrf;
> The Gospel According to Matthew
> ed'gef;
> usrf;�yyk*�dKvf  &Sifrmaw;OD;\b0rSwfwrf;  
> usrf;�yyk*�dKvf  &Sifrmaw;OD;onf  *gavav;,e,frS *sL;vlrsKd;
> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf trIxrf;chJonf/
> (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD  ol\trnfrSm av0djzpf\/
> olonf  wdab;&d,tkdifteD;wGif  a,Zl;ocifESifhawGU  NyD;
>
>
> You can see that some letters have changed, and some others are in a
> different order. 
>
> The letters that change are likely those points that aren't compatible
> with unicode, and pagemaker reassigned them to ensure that the file is
> more widely viewable. Since a conversion is already planned, these
> won't matter as much, but the font embedded in the PDF is different
> than the font attached to the pagemaker file,  If you do start from
> the PDF, you'll need to extract the font to get the code points. 
>
> The problem is that the PDF export from pagemaker sorts the letters
> into the order they appear on the page.  Burmese text has Indian style
> ligatures, where vowels tend to jump over or under the previous
> letters, sometimes back 2 or three letters. If you study the following
> snippets from the beginning of Matthew, you can see there is a
> difference in order, as well as some glyphs are modified. 
>
> So, from the PDF letters are out of order, but from Pagemaker, letters
> are encoded into control points. Fixing the control points is easy and
> happens with the unicode conversion.  Fixing the letter order is not
> easy. You'll need a first language speaker and plenty of time. 
>
> The guidance I received on another group was to use either LO Draw or
> Indesign to export the text from Pagemaker.  I'll look into LO Draw
> again, but I don't have access to an older version of Indesign (the
> pagemaker import was removed in CS6). 
>
>
> On Mon, May 13, 2019 at 10:40 AM Michael H <cmahte at gmail.com
> <mailto:cmahte at gmail.com>> wrote:
>
>     I unzipped the pagemaker file, and when I open
>     NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can 'find' all
>     of the book names, and see the text there.  
>
>     To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and
>     open it with a zip archive progeram.  The text is in the Pagemaker
>     file at the top level of the archive, but encoded with a lot of
>     extraneous information.  (The English text "Matthew" appears at
>     hex location 7A76972). 
>
>     When I open the fonts with fontforge, Fontforge suggests the fonts
>     are encoded as unicode (but the glyphs are obviously not in the
>     right spot.) 
>     However when I copy the text (I copied from LO Draw) and paste it
>     into jedit and save that as unicode: Reopening the file has a
>     warning 'not unicode, text may be missing'. 
>
>     So, what this means is that there are some glyphs encoded into
>     locations that unicode treats as control or non-printing codes.
>     The text needs to be dealt with as a specific encoding that
>     matches whatever the original font actually uses. I haven't
>     figured out what the original text files were encoded with.
>     Without that knowledge, I'm not sure my system clipboard or editor
>     (jedit) will properly respect the glyphs in unusual locations
>     until the conversion to unicode, and I don't trust myself to be
>     able to detect if it is or is not properly converted. 
>
>     On Mon, May 13, 2019 at 10:11 AM Cyrille <lafricain79 at gmail.com
>     <mailto:lafricain79 at gmail.com>> wrote:
>
>         David,
>         Probably you are right about TECkit
>         <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>,
>         if we get the text it will help us to convert in UNICODE.
>         About how to get the text, your method is out of my skills :)
>         I you succeed please let me know.
>
>         Il 13/05/2019 16:21, David Haslam ha scritto:
>>         Given the insights from Michael Hart, it may be feasible to
>>         temporarily rearrange the main text stream as follows :
>>
>>         1. Replace every EOL by a horizontal tab. 
>>         2. Insert an EOL after each verse end character. 
>>
>>         Observe that the above two steps are wholly reversible such
>>         that the original text stream can be restored later. 
>>
>>         In effect the text stream is now in verse per line (VPL)
>>         layout, albeit without verse tags. Some adjustments may be
>>         necessary if there any section headings, etc. 
>>
>>         3. Add line numbers with the first number being reset to 1 at
>>         the start of each chapter, numbers incrementing by 1 for each
>>         line. 
>>         4. Add a left margin USFM verse tag \v_
>>
>>         Steps 3&4 can be implemented in various ways. For my part,
>>         I’d use a bespoke TextPipe filter. 
>>
>>         Another method to consider might be to use Excel formulae. I
>>         recall resorting to such a method in the early days of Go Bible. 
>>
>>         Now restore the original layout by reverting steps 2 & 1, if
>>         this is really necessary. That is, if the original text
>>         layout appeared to be paragraphed. 
>>
>>         5. Decide how & where to insert paragraph tags. 
>>
>>         6. Add chapter tags, book ID and main title tags, etc. 
>>
>>         Hope this gives some useful suggestions that point towards a
>>         practical solution. 
>>
>>         Best regards 
>>
>>         David
>>
>>
>>         Sent from ProtonMail Mobile
>>
>>
>>         On Mon, May 13, 2019 at 14:57, Michael H <cmahte at gmail.com
>>         <mailto:cmahte at gmail.com>> wrote:
>>>         Cyrille
>>>
>>>         LibreOffice Draw attempts to open the pagemaker file, with
>>>         limited success. But it confirms that even in the pagemaker
>>>         source, the verse numbers are a separate text stream. With
>>>         this source, there is no way to copy the text with verse
>>>         numbers intact. It appears to be stored with each book in
>>>         it's own text stream. Each book is a separate text stream in
>>>         the page maker file. LO Draw isn't rendering all of the
>>>         pages, only the first 10, So I've only explored Matthew
>>>         further. 
>>>
>>>         Based on Matthew only, the verses seem to all end with the
>>>         character "-" or ";/", which should aid in the
>>>         reconstruction. I've looked through the PDF and this seems
>>>         to be the case for all books visually as well. However, this
>>>         isn't perfect: I find 1107 of these characters in Matthew,
>>>         instead of the expected 1071 verses.  But since the text
>>>         stream has a book introduction, this is likely easily
>>>         explained. Hopefully this gets you well down the path to
>>>         creating a stream with verses. 
>>>
>>>         I would NOT start from the PDF file, but from the pagemaker
>>>         file.  The PDF almost certainly has a lot of text
>>>         rearranging and extra characters like page numbers and
>>>         running heads.  Pagemaker has the book text in a single
>>>         stream, in a form that will convert to unicode relatively
>>>         easily. 
>>>
>>
>>
>>
>>         _______________________________________________
>>         sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>         http://www.crosswire.org/mailman/listinfo/sword-devel
>>         Instructions to unsubscribe/change your settings at above page
>
>         _______________________________________________
>         sword-devel mailing list: sword-devel at crosswire.org
>         <mailto:sword-devel at crosswire.org>
>         http://www.crosswire.org/mailman/listinfo/sword-devel
>         Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/0829a824/attachment-0001.html>