[sword-devel] Bible in Myanmar

Cyrille lafricain79 at gmail.com
Tue May 14 13:42:07 MST 2019



Il 14/05/2019 22:26, David Haslam ha scritto:
> If Michael’s observations are anything to go by, then maybe I can
> script the recovery of chapter & verse tags. 
>
> We shall see ....
>
> Even if I’m not immediately successful - valuable lessons can be
> learned in the attempt.
Very, well, I'll wait for you ;)
>
> David
>
> Sent from ProtonMail Mobile
>
>
> On Tue, May 14, 2019 at 21:21, Cyrille <lafricain79 at gmail.com
> <mailto:lafricain79 at gmail.com>> wrote:
>> Ok thank you!  I have already all the text in unicode but without the
>> verse numbers and chapters... I begun manually...
>>
>> Il 14/05/2019 22:17, David Haslam ha scritto:
>>> Hi Cyrille 
>>>
>>> If I can find the time tomorrow or later, I’ll have a look at what
>>> might be feasible. 
>>>
>>> Thanks for all these useful links. 
>>>
>>> David
>>>
>>> Sent from ProtonMail Mobile
>>>
>>>
>>> On Tue, May 14, 2019 at 14:08, Cyrille <lafricain79 at gmail.com
>>> <mailto:lafricain79 at gmail.com>> wrote:
>>>> I send my message again because it was bigger.
>>>>
>>>> The conversion to UTF-8 is 99% solved!! I used a online converter:
>>>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>>>> or:
>>>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>>>
>>>> See the result here
>>>> <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>.
>>>>
>>>> Now the only problem is how to get the verse and chapter number...
>>>>
>>>>
>>>> Il 14/05/2019 13:53, Michael H ha scritto:
>>>>> Cyrille, (Peter), 
>>>>>
>>>>> Maybe further discussion on this belongs in Gitlab as issues.  Can
>>>>> I get added to this project? 
>>>>>
>>>>> Here are the first few lines of Matthew copied from the PDF: 
>>>>> ------
>>>>> &Sifrmaw;OD; {0Ha*vdusrf;
>>>>> The Gospel According to Matthew
>>>>> ed'gef;
>>>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f
>>>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d
>>>>> tmvaf z;O;D \om;jzp\f / (rmu k2;14)
>>>>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
>>>>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>>>>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf
>>>>> wG U Ny;D
>>>>>
>>>>> -----
>>>>> And here are the first few lines of Matthew copied from the
>>>>> Pagemaker file: 
>>>>> -----
>>>>> Sifrmaw;OD; {0Ha*vdusrf;
>>>>> The Gospel According to Matthew
>>>>> ed'gef;
>>>>> usrf;�yyk*�dKvf  &Sifrmaw;OD;\b0rSwfwrf;  
>>>>> usrf;�yyk*�dKvf  &Sifrmaw;OD;onf  *gavav;,e,frS *sL;vlrsKd;
>>>>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf
>>>>> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
>>>>> ol\trnfrSm av0djzpf\/ olonf  wdab;&d,tkdifteD;wGif 
>>>>> a,Zl;ocifESifhawGU  NyD;
>>>>>
>>>>>
>>>>> You can see that some letters have changed, and some others are in
>>>>> a different order. 
>>>>>
>>>>> The letters that change are likely those points that aren't
>>>>> compatible with unicode, and pagemaker reassigned them to ensure
>>>>> that the file is more widely viewable. Since a conversion is
>>>>> already planned, these won't matter as much, but the font embedded
>>>>> in the PDF is different than the font attached to the pagemaker
>>>>> file,  If you do start from the PDF, you'll need to extract the
>>>>> font to get the code points. 
>>>>>
>>>>> The problem is that the PDF export from pagemaker sorts the
>>>>> letters into the order they appear on the page.  Burmese text has
>>>>> Indian style ligatures, where vowels tend to jump over or under
>>>>> the previous letters, sometimes back 2 or three letters. If you
>>>>> study the following snippets from the beginning of Matthew, you
>>>>> can see there is a difference in order, as well as some glyphs are
>>>>> modified. 
>>>>>
>>>>> So, from the PDF letters are out of order, but from Pagemaker,
>>>>> letters are encoded into control points. Fixing the control points
>>>>> is easy and happens with the unicode conversion.  Fixing the
>>>>> letter order is not easy. You'll need a first language speaker and
>>>>> plenty of time. 
>>>>>
>>>>> The guidance I received on another group was to use either LO Draw
>>>>> or Indesign to export the text from Pagemaker.  I'll look into LO
>>>>> Draw again, but I don't have access to an older version of
>>>>> Indesign (the pagemaker import was removed in CS6). 
>>>>>
>>>>>
>>>>> On Mon, May 13, 2019 at 10:40 AM Michael H <cmahte at gmail.com
>>>>> <mailto:cmahte at gmail.com>> wrote:
>>>>>
>>>>>     I unzipped the pagemaker file, and when I open
>>>>>     NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can 'find'
>>>>>     all of the book names, and see the text there.  
>>>>>
>>>>>     To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip
>>>>>     and open it with a zip archive progeram.  The text is in the
>>>>>     Pagemaker file at the top level of the archive, but encoded
>>>>>     with a lot of extraneous information.  (The English text
>>>>>     "Matthew" appears at hex location 7A76972). 
>>>>>
>>>>>     When I open the fonts with fontforge, Fontforge suggests the
>>>>>     fonts are encoded as unicode (but the glyphs are obviously not
>>>>>     in the right spot.) 
>>>>>     However when I copy the text (I copied from LO Draw) and paste
>>>>>     it into jedit and save that as unicode: Reopening the file has
>>>>>     a warning 'not unicode, text may be missing'. 
>>>>>
>>>>>     So, what this means is that there are some glyphs encoded into
>>>>>     locations that unicode treats as control or non-printing
>>>>>     codes. The text needs to be dealt with as a specific encoding
>>>>>     that matches whatever the original font actually uses. I
>>>>>     haven't figured out what the original text files were encoded
>>>>>     with. Without that knowledge, I'm not sure my system clipboard
>>>>>     or editor (jedit) will properly respect the glyphs in unusual
>>>>>     locations until the conversion to unicode, and I don't trust
>>>>>     myself to be able to detect if it is or is not properly
>>>>>     converted. 
>>>>>
>>>>>     On Mon, May 13, 2019 at 10:11 AM Cyrille
>>>>>     <lafricain79 at gmail.com <mailto:lafricain79 at gmail.com>> wrote:
>>>>>
>>>>>         David,
>>>>>         Probably you are right about TECkit
>>>>>         <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>,
>>>>>         if we get the text it will help us to convert in UNICODE.
>>>>>         About how to get the text, your method is out of my skills :)
>>>>>         I you succeed please let me know.
>>>>>
>>>>>         Il 13/05/2019 16:21, David Haslam ha scritto:
>>>>>>         Given the insights from Michael Hart, it may be feasible
>>>>>>         to temporarily rearrange the main text stream as follows :
>>>>>>
>>>>>>         1. Replace every EOL by a horizontal tab. 
>>>>>>         2. Insert an EOL after each verse end character. 
>>>>>>
>>>>>>         Observe that the above two steps are wholly reversible
>>>>>>         such that the original text stream can be restored later. 
>>>>>>
>>>>>>         In effect the text stream is now in verse per line (VPL)
>>>>>>         layout, albeit without verse tags. Some adjustments may
>>>>>>         be necessary if there any section headings, etc. 
>>>>>>
>>>>>>         3. Add line numbers with the first number being reset to
>>>>>>         1 at the start of each chapter, numbers incrementing by 1
>>>>>>         for each line. 
>>>>>>         4. Add a left margin USFM verse tag \v_
>>>>>>
>>>>>>         Steps 3&4 can be implemented in various ways. For my
>>>>>>         part, I’d use a bespoke TextPipe filter. 
>>>>>>
>>>>>>         Another method to consider might be to use Excel
>>>>>>         formulae. I recall resorting to such a method in the
>>>>>>         early days of Go Bible. 
>>>>>>
>>>>>>         Now restore the original layout by reverting steps 2 & 1,
>>>>>>         if this is really necessary. That is, if the original
>>>>>>         text layout appeared to be paragraphed. 
>>>>>>
>>>>>>         5. Decide how & where to insert paragraph tags. 
>>>>>>
>>>>>>         6. Add chapter tags, book ID and main title tags, etc. 
>>>>>>
>>>>>>         Hope this gives some useful suggestions that point
>>>>>>         towards a practical solution. 
>>>>>>
>>>>>>         Best regards 
>>>>>>
>>>>>>         David
>>>>>>
>>>>>>
>>>>>>         Sent from ProtonMail Mobile
>>>>>>
>>>>>>
>>>>>>         On Mon, May 13, 2019 at 14:57, Michael H
>>>>>>         <cmahte at gmail.com <mailto:cmahte at gmail.com>> wrote:
>>>>>>>         Cyrille
>>>>>>>
>>>>>>>         LibreOffice Draw attempts to open the pagemaker file,
>>>>>>>         with limited success. But it confirms that even in the
>>>>>>>         pagemaker source, the verse numbers are a separate text
>>>>>>>         stream. With this source, there is no way to copy the
>>>>>>>         text with verse numbers intact. It appears to be stored
>>>>>>>         with each book in it's own text stream. Each book is a
>>>>>>>         separate text stream in the page maker file. LO Draw
>>>>>>>         isn't rendering all of the pages, only the first 10, So
>>>>>>>         I've only explored Matthew further. 
>>>>>>>
>>>>>>>         Based on Matthew only, the verses seem to all end with
>>>>>>>         the character "-" or ";/", which should aid in the
>>>>>>>         reconstruction. I've looked through the PDF and this
>>>>>>>         seems to be the case for all books visually as well.
>>>>>>>         However, this isn't perfect: I find 1107 of these
>>>>>>>         characters in Matthew, instead of the expected 1071
>>>>>>>         verses.  But since the text stream has a book
>>>>>>>         introduction, this is likely easily explained. Hopefully
>>>>>>>         this gets you well down the path to creating a stream
>>>>>>>         with verses. 
>>>>>>>
>>>>>>>         I would NOT start from the PDF file, but from the
>>>>>>>         pagemaker file.  The PDF almost certainly has a lot of
>>>>>>>         text rearranging and extra characters like page numbers
>>>>>>>         and running heads.  Pagemaker has the book text in a
>>>>>>>         single stream, in a form that will convert to unicode
>>>>>>>         relatively easily. 
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>         _______________________________________________
>>>>>>         sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>>>         http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>>         Instructions to unsubscribe/change your settings at above page
>>>>>
>>>>>         _______________________________________________
>>>>>         sword-devel mailing list: sword-devel at crosswire.org
>>>>>         <mailto:sword-devel at crosswire.org>
>>>>>         http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>         Instructions to unsubscribe/change your settings at above page
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/c4e154db/attachment-0001.html>


More information about the sword-devel mailing list