[sword-devel] Bible in Myanmar

Tue May 14 13:57:43 MST 2019


Il 14/05/2019 22:48, Michael H ha scritto:
> You should be able to configure a regex search to find the verse
> boundaries.
>
> Once you have verse boundaries, if you configure the text into Verse
> per line it should be possible to assign each row a chapter and verse
> number from a reference. That is, the 3341 verse in the New Testament
> is usually John 20:31 (I don't have that memorized, just an example.)

I have no idea how to do this :)
>
> On Tue, May 14, 2019 at 3:22 PM Cyrille <lafricain79 at gmail.com
> <mailto:lafricain79 at gmail.com>> wrote:
>
>     Ok thank you!  I have already all the text in unicode but without
>     the verse numbers and chapters... I begun manually...
>
>     Il 14/05/2019 22:17, David Haslam ha scritto:
>>     Hi Cyrille 
>>
>>     If I can find the time tomorrow or later, I’ll have a look at
>>     what might be feasible. 
>>
>>     Thanks for all these useful links. 
>>
>>     David
>>
>>     Sent from ProtonMail Mobile
>>
>>
>>     On Tue, May 14, 2019 at 14:08, Cyrille <lafricain79 at gmail.com
>>     <mailto:lafricain79 at gmail.com>> wrote:
>>>     I send my message again because it was bigger.
>>>
>>>     The conversion to UTF-8 is 99% solved!! I used a online converter:
>>>     https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>>>     or:
>>>     http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>>
>>>     See the result here
>>>     <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>.
>>>
>>>     Now the only problem is how to get the verse and chapter number...
>>>
>>>
>>>     Il 14/05/2019 13:53, Michael H ha scritto:
>>>>     Cyrille, (Peter), 
>>>>
>>>>     Maybe further discussion on this belongs in Gitlab as issues. 
>>>>     Can I get added to this project? 
>>>>
>>>>     Here are the first few lines of Matthew copied from the PDF: 
>>>>     ------
>>>>     &Sifrmaw;OD; {0Ha*vdusrf;
>>>>     The Gospel According to Matthew
>>>>     ed'gef;
>>>>     usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f
>>>>     usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl
>>>>     sK;d tmvaf z;O;D \om;jzp\f / (rmu k2;14)
>>>>     olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
>>>>     a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>>>>     av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS
>>>>     ahf wG U Ny;D
>>>>
>>>>     -----
>>>>     And here are the first few lines of Matthew copied from the
>>>>     Pagemaker file: 
>>>>     -----
>>>>     Sifrmaw;OD; {0Ha*vdusrf;
>>>>     The Gospel According to Matthew
>>>>     ed'gef;
>>>>     usrf;�yyk*�dKvf  &Sifrmaw;OD;\b0rSwfwrf;  
>>>>     usrf;�yyk*�dKvf  &Sifrmaw;OD;onf  *gavav;,e,frS *sL;vlrsKd;
>>>>     tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf
>>>>     trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
>>>>     ol\trnfrSm av0djzpf\/ olonf  wdab;&d,tkdifteD;wGif 
>>>>     a,Zl;ocifESifhawGU  NyD;
>>>>
>>>>
>>>>     You can see that some letters have changed, and some others are
>>>>     in a different order. 
>>>>
>>>>     The letters that change are likely those points that aren't
>>>>     compatible with unicode, and pagemaker reassigned them to
>>>>     ensure that the file is more widely viewable. Since a
>>>>     conversion is already planned, these won't matter as much, but
>>>>     the font embedded in the PDF is different than the font
>>>>     attached to the pagemaker file,  If you do start from the PDF,
>>>>     you'll need to extract the font to get the code points. 
>>>>
>>>>     The problem is that the PDF export from pagemaker sorts the
>>>>     letters into the order they appear on the page.  Burmese text
>>>>     has Indian style ligatures, where vowels tend to jump over or
>>>>     under the previous letters, sometimes back 2 or three letters.
>>>>     If you study the following snippets from the beginning of
>>>>     Matthew, you can see there is a difference in order, as well as
>>>>     some glyphs are modified. 
>>>>
>>>>     So, from the PDF letters are out of order, but from Pagemaker,
>>>>     letters are encoded into control points. Fixing the control
>>>>     points is easy and happens with the unicode conversion.  Fixing
>>>>     the letter order is not easy. You'll need a first language
>>>>     speaker and plenty of time. 
>>>>
>>>>     The guidance I received on another group was to use either LO
>>>>     Draw or Indesign to export the text from Pagemaker.  I'll look
>>>>     into LO Draw again, but I don't have access to an older version
>>>>     of Indesign (the pagemaker import was removed in CS6). 
>>>>
>>>>
>>>>     On Mon, May 13, 2019 at 10:40 AM Michael H <cmahte at gmail.com
>>>>     <mailto:cmahte at gmail.com>> wrote:
>>>>
>>>>         I unzipped the pagemaker file, and when I open
>>>>         NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can
>>>>         'find' all of the book names, and see the text there.  
>>>>
>>>>         To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip
>>>>         and open it with a zip archive progeram.  The text is in
>>>>         the Pagemaker file at the top level of the archive, but
>>>>         encoded with a lot of extraneous information.  (The English
>>>>         text "Matthew" appears at hex location 7A76972). 
>>>>
>>>>         When I open the fonts with fontforge, Fontforge suggests
>>>>         the fonts are encoded as unicode (but the glyphs are
>>>>         obviously not in the right spot.) 
>>>>         However when I copy the text (I copied from LO Draw) and
>>>>         paste it into jedit and save that as unicode: Reopening the
>>>>         file has a warning 'not unicode, text may be missing'. 
>>>>
>>>>         So, what this means is that there are some glyphs encoded
>>>>         into locations that unicode treats as control or
>>>>         non-printing codes. The text needs to be dealt with as a
>>>>         specific encoding that matches whatever the original font
>>>>         actually uses. I haven't figured out what the original text
>>>>         files were encoded with. Without that knowledge, I'm not
>>>>         sure my system clipboard or editor (jedit) will properly
>>>>         respect the glyphs in unusual locations until the
>>>>         conversion to unicode, and I don't trust myself to be able
>>>>         to detect if it is or is not properly converted. 
>>>>
>>>>         On Mon, May 13, 2019 at 10:11 AM Cyrille
>>>>         <lafricain79 at gmail.com <mailto:lafricain79 at gmail.com>> wrote:
>>>>
>>>>             David,
>>>>             Probably you are right about TECkit
>>>>             <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>,
>>>>             if we get the text it will help us to convert in UNICODE.
>>>>             About how to get the text, your method is out of my
>>>>             skills :)
>>>>             I you succeed please let me know.
>>>>
>>>>             Il 13/05/2019 16:21, David Haslam ha scritto:
>>>>>             Given the insights from Michael Hart, it may be
>>>>>             feasible to temporarily rearrange the main text stream
>>>>>             as follows :
>>>>>
>>>>>             1. Replace every EOL by a horizontal tab. 
>>>>>             2. Insert an EOL after each verse end character. 
>>>>>
>>>>>             Observe that the above two steps are wholly reversible
>>>>>             such that the original text stream can be restored later. 
>>>>>
>>>>>             In effect the text stream is now in verse per line
>>>>>             (VPL) layout, albeit without verse tags. Some
>>>>>             adjustments may be necessary if there any section
>>>>>             headings, etc. 
>>>>>
>>>>>             3. Add line numbers with the first number being reset
>>>>>             to 1 at the start of each chapter, numbers
>>>>>             incrementing by 1 for each line. 
>>>>>             4. Add a left margin USFM verse tag \v_
>>>>>
>>>>>             Steps 3&4 can be implemented in various ways. For my
>>>>>             part, I’d use a bespoke TextPipe filter. 
>>>>>
>>>>>             Another method to consider might be to use Excel
>>>>>             formulae. I recall resorting to such a method in the
>>>>>             early days of Go Bible. 
>>>>>
>>>>>             Now restore the original layout by reverting steps 2 &
>>>>>             1, if this is really necessary. That is, if the
>>>>>             original text layout appeared to be paragraphed. 
>>>>>
>>>>>             5. Decide how & where to insert paragraph tags. 
>>>>>
>>>>>             6. Add chapter tags, book ID and main title tags, etc. 
>>>>>
>>>>>             Hope this gives some useful suggestions that point
>>>>>             towards a practical solution. 
>>>>>
>>>>>             Best regards 
>>>>>
>>>>>             David
>>>>>
>>>>>
>>>>>             Sent from ProtonMail Mobile
>>>>>
>>>>>
>>>>>             On Mon, May 13, 2019 at 14:57, Michael H
>>>>>             <cmahte at gmail.com <mailto:cmahte at gmail.com>> wrote:
>>>>>>             Cyrille
>>>>>>
>>>>>>             LibreOffice Draw attempts to open the pagemaker file,
>>>>>>             with limited success. But it confirms that even in
>>>>>>             the pagemaker source, the verse numbers are a
>>>>>>             separate text stream. With this source, there is no
>>>>>>             way to copy the text with verse numbers intact. It
>>>>>>             appears to be stored with each book in it's own text
>>>>>>             stream. Each book is a separate text stream in the
>>>>>>             page maker file. LO Draw isn't rendering all of the
>>>>>>             pages, only the first 10, So I've only explored
>>>>>>             Matthew further. 
>>>>>>
>>>>>>             Based on Matthew only, the verses seem to all end
>>>>>>             with the character "-" or ";/", which should aid in
>>>>>>             the reconstruction. I've looked through the PDF and
>>>>>>             this seems to be the case for all books visually as
>>>>>>             well. However, this isn't perfect: I find 1107 of
>>>>>>             these characters in Matthew, instead of the expected
>>>>>>             1071 verses.  But since the text stream has a book
>>>>>>             introduction, this is likely easily explained.
>>>>>>             Hopefully this gets you well down the path to
>>>>>>             creating a stream with verses. 
>>>>>>
>>>>>>             I would NOT start from the PDF file, but from the
>>>>>>             pagemaker file.  The PDF almost certainly has a lot
>>>>>>             of text rearranging and extra characters like page
>>>>>>             numbers and running heads.  Pagemaker has the book
>>>>>>             text in a single stream, in a form that will convert
>>>>>>             to unicode relatively easily. 
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>             _______________________________________________
>>>>>             sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>>             http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>>             Instructions to unsubscribe/change your settings at above page
>>>>
>>>>             _______________________________________________
>>>>             sword-devel mailing list: sword-devel at crosswire.org
>>>>             <mailto:sword-devel at crosswire.org>
>>>>             http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>             Instructions to unsubscribe/change your settings at
>>>>             above page
>>>>
>>>>
>>>>     _______________________________________________
>>>>     sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>     Instructions to unsubscribe/change your settings at above page
>>>
>>
>>
>>
>>     _______________________________________________
>>     sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>     Instructions to unsubscribe/change your settings at above page
>
>     _______________________________________________
>     sword-devel mailing list: sword-devel at crosswire.org
>     <mailto:sword-devel at crosswire.org>
>     http://www.crosswire.org/mailman/listinfo/sword-devel
>     Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/5a4bb0e8/attachment-0001.html>