[sword-devel] Bible in Myanmar

Tue May 14 13:56:53 MST 2019


Il 14/05/2019 22:55, Cyrille ha scritto:
>
>
> Il 14/05/2019 22:45, Michael H ha scritto:
>> Cyrille, did you start from the PDF or the pagemaker file?
> PMaker
>> Either way, you should send a snippet to your source and validate the
>> words are still readable. As small as 30 words should be enough.
The convert text? If yes look the attached file.
>>
>> On Tue, May 14, 2019 at 8:09 AM Cyrille <lafricain79 at gmail.com
>> <mailto:lafricain79 at gmail.com>> wrote:
>>
>>     I send my message again because it was bigger.
>>
>>     The conversion to UTF-8 is 99% solved!! I used a online converter:
>>     https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>>     or:
>>     http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>
>>     See the result here
>>     <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>.
>>
>>     Now the only problem is how to get the verse and chapter number...
>>
>>
>>     Il 14/05/2019 13:53, Michael H ha scritto:
>>>     Cyrille, (Peter), 
>>>
>>>     Maybe further discussion on this belongs in Gitlab as issues. 
>>>     Can I get added to this project? 
>>>
>>>     Here are the first few lines of Matthew copied from the PDF: 
>>>     ------
>>>     &Sifrmaw;OD; {0Ha*vdusrf;
>>>     The Gospel According to Matthew
>>>     ed'gef;
>>>     usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f
>>>     usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d
>>>     tmvaf z;O;D \om;jzp\f / (rmu k2;14)
>>>     olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
>>>     a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>>>     av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf
>>>     wG U Ny;D
>>>
>>>     -----
>>>     And here are the first few lines of Matthew copied from the
>>>     Pagemaker file: 
>>>     -----
>>>     Sifrmaw;OD; {0Ha*vdusrf;
>>>     The Gospel According to Matthew
>>>     ed'gef;
>>>     usrf;�yyk*�dKvf  &Sifrmaw;OD;\b0rSwfwrf;  
>>>     usrf;�yyk*�dKvf  &Sifrmaw;OD;onf  *gavav;,e,frS *sL;vlrsKd;
>>>     tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf  tcGefcHoltjzpf
>>>     trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD 
>>>     ol\trnfrSm av0djzpf\/ olonf  wdab;&d,tkdifteD;wGif 
>>>     a,Zl;ocifESifhawGU  NyD;
>>>
>>>
>>>     You can see that some letters have changed, and some others are
>>>     in a different order. 
>>>
>>>     The letters that change are likely those points that aren't
>>>     compatible with unicode, and pagemaker reassigned them to ensure
>>>     that the file is more widely viewable. Since a conversion is
>>>     already planned, these won't matter as much, but the font
>>>     embedded in the PDF is different than the font attached to the
>>>     pagemaker file,  If you do start from the PDF, you'll need to
>>>     extract the font to get the code points. 
>>>
>>>     The problem is that the PDF export from pagemaker sorts the
>>>     letters into the order they appear on the page.  Burmese text
>>>     has Indian style ligatures, where vowels tend to jump over or
>>>     under the previous letters, sometimes back 2 or three letters.
>>>     If you study the following snippets from the beginning of
>>>     Matthew, you can see there is a difference in order, as well as
>>>     some glyphs are modified. 
>>>
>>>     So, from the PDF letters are out of order, but from Pagemaker,
>>>     letters are encoded into control points. Fixing the control
>>>     points is easy and happens with the unicode conversion.  Fixing
>>>     the letter order is not easy. You'll need a first language
>>>     speaker and plenty of time. 
>>>
>>>     The guidance I received on another group was to use either LO
>>>     Draw or Indesign to export the text from Pagemaker.  I'll look
>>>     into LO Draw again, but I don't have access to an older version
>>>     of Indesign (the pagemaker import was removed in CS6). 
>>>
>>>
>>>     On Mon, May 13, 2019 at 10:40 AM Michael H <cmahte at gmail.com
>>>     <mailto:cmahte at gmail.com>> wrote:
>>>
>>>         I unzipped the pagemaker file, and when I open
>>>         NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can
>>>         'find' all of the book names, and see the text there.  
>>>
>>>         To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip
>>>         and open it with a zip archive progeram.  The text is in the
>>>         Pagemaker file at the top level of the archive, but encoded
>>>         with a lot of extraneous information.  (The English text
>>>         "Matthew" appears at hex location 7A76972). 
>>>
>>>         When I open the fonts with fontforge, Fontforge suggests the
>>>         fonts are encoded as unicode (but the glyphs are obviously
>>>         not in the right spot.) 
>>>         However when I copy the text (I copied from LO Draw) and
>>>         paste it into jedit and save that as unicode: Reopening the
>>>         file has a warning 'not unicode, text may be missing'. 
>>>
>>>         So, what this means is that there are some glyphs encoded
>>>         into locations that unicode treats as control or
>>>         non-printing codes. The text needs to be dealt with as a
>>>         specific encoding that matches whatever the original font
>>>         actually uses. I haven't figured out what the original text
>>>         files were encoded with. Without that knowledge, I'm not
>>>         sure my system clipboard or editor (jedit) will properly
>>>         respect the glyphs in unusual locations until the conversion
>>>         to unicode, and I don't trust myself to be able to detect if
>>>         it is or is not properly converted. 
>>>
>>>         On Mon, May 13, 2019 at 10:11 AM Cyrille
>>>         <lafricain79 at gmail.com <mailto:lafricain79 at gmail.com>> wrote:
>>>
>>>             David,
>>>             Probably you are right about TECkit
>>>             <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>,
>>>             if we get the text it will help us to convert in UNICODE.
>>>             About how to get the text, your method is out of my
>>>             skills :)
>>>             I you succeed please let me know.
>>>
>>>             Il 13/05/2019 16:21, David Haslam ha scritto:
>>>>             Given the insights from Michael Hart, it may be
>>>>             feasible to temporarily rearrange the main text stream
>>>>             as follows :
>>>>
>>>>             1. Replace every EOL by a horizontal tab. 
>>>>             2. Insert an EOL after each verse end character. 
>>>>
>>>>             Observe that the above two steps are wholly reversible
>>>>             such that the original text stream can be restored later. 
>>>>
>>>>             In effect the text stream is now in verse per line
>>>>             (VPL) layout, albeit without verse tags. Some
>>>>             adjustments may be necessary if there any section
>>>>             headings, etc. 
>>>>
>>>>             3. Add line numbers with the first number being reset
>>>>             to 1 at the start of each chapter, numbers incrementing
>>>>             by 1 for each line. 
>>>>             4. Add a left margin USFM verse tag \v_
>>>>
>>>>             Steps 3&4 can be implemented in various ways. For my
>>>>             part, I’d use a bespoke TextPipe filter. 
>>>>
>>>>             Another method to consider might be to use Excel
>>>>             formulae. I recall resorting to such a method in the
>>>>             early days of Go Bible. 
>>>>
>>>>             Now restore the original layout by reverting steps 2 &
>>>>             1, if this is really necessary. That is, if the
>>>>             original text layout appeared to be paragraphed. 
>>>>
>>>>             5. Decide how & where to insert paragraph tags. 
>>>>
>>>>             6. Add chapter tags, book ID and main title tags, etc. 
>>>>
>>>>             Hope this gives some useful suggestions that point
>>>>             towards a practical solution. 
>>>>
>>>>             Best regards 
>>>>
>>>>             David
>>>>
>>>>
>>>>             Sent from ProtonMail Mobile
>>>>
>>>>
>>>>             On Mon, May 13, 2019 at 14:57, Michael H
>>>>             <cmahte at gmail.com <mailto:cmahte at gmail.com>> wrote:
>>>>>             Cyrille
>>>>>
>>>>>             LibreOffice Draw attempts to open the pagemaker file,
>>>>>             with limited success. But it confirms that even in the
>>>>>             pagemaker source, the verse numbers are a separate
>>>>>             text stream. With this source, there is no way to copy
>>>>>             the text with verse numbers intact. It appears to be
>>>>>             stored with each book in it's own text stream. Each
>>>>>             book is a separate text stream in the page maker file.
>>>>>             LO Draw isn't rendering all of the pages, only the
>>>>>             first 10, So I've only explored Matthew further. 
>>>>>
>>>>>             Based on Matthew only, the verses seem to all end with
>>>>>             the character "-" or ";/", which should aid in the
>>>>>             reconstruction. I've looked through the PDF and this
>>>>>             seems to be the case for all books visually as well.
>>>>>             However, this isn't perfect: I find 1107 of these
>>>>>             characters in Matthew, instead of the expected 1071
>>>>>             verses.  But since the text stream has a book
>>>>>             introduction, this is likely easily explained.
>>>>>             Hopefully this gets you well down the path to creating
>>>>>             a stream with verses. 
>>>>>
>>>>>             I would NOT start from the PDF file, but from the
>>>>>             pagemaker file.  The PDF almost certainly has a lot of
>>>>>             text rearranging and extra characters like page
>>>>>             numbers and running heads.  Pagemaker has the book
>>>>>             text in a single stream, in a form that will convert
>>>>>             to unicode relatively easily. 
>>>>>
>>>>
>>>>
>>>>
>>>>             _______________________________________________
>>>>             sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>             http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>             Instructions to unsubscribe/change your settings at above page
>>>
>>>             _______________________________________________
>>>             sword-devel mailing list: sword-devel at crosswire.org
>>>             <mailto:sword-devel at crosswire.org>
>>>             http://www.crosswire.org/mailman/listinfo/sword-devel
>>>             Instructions to unsubscribe/change your settings at
>>>             above page
>>>
>>>
>>>     _______________________________________________
>>>     sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>>     Instructions to unsubscribe/change your settings at above page
>>
>>     _______________________________________________
>>     sword-devel mailing list: sword-devel at crosswire.org
>>     <mailto:sword-devel at crosswire.org>
>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>     Instructions to unsubscribe/change your settings at above page
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/d683b522/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TIT.zip
Type: application/zip
Size: 6139 bytes
Desc: not available
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/d683b522/attachment-0001.zip>