[sword-devel] Bible in Myanmar
Cyrille
lafricain79 at gmail.com
Tue May 14 13:56:53 MST 2019
Il 14/05/2019 22:55, Cyrille ha scritto:
>
>
> Il 14/05/2019 22:45, Michael H ha scritto:
>> Cyrille, did you start from the PDF or the pagemaker file?
> PMaker
>> Either way, you should send a snippet to your source and validate the
>> words are still readable. As small as 30 words should be enough.
The convert text? If yes look the attached file.
>>
>> On Tue, May 14, 2019 at 8:09 AM Cyrille <lafricain79 at gmail.com
>> <mailto:lafricain79 at gmail.com>> wrote:
>>
>> I send my message again because it was bigger.
>>
>> The conversion to UTF-8 is 99% solved!! I used a online converter:
>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>> or:
>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>
>> See the result here
>> <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>.
>>
>> Now the only problem is how to get the verse and chapter number...
>>
>>
>> Il 14/05/2019 13:53, Michael H ha scritto:
>>> Cyrille, (Peter),
>>>
>>> Maybe further discussion on this belongs in Gitlab as issues.
>>> Can I get added to this project?
>>>
>>> Here are the first few lines of Matthew copied from the PDF:
>>> ------
>>> &Sifrmaw;OD; {0Ha*vdusrf;
>>> The Gospel According to Matthew
>>> ed'gef;
>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f
>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl sK;d
>>> tmvaf z;O;D \om;jzp\f / (rmu k2;14)
>>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
>>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS ahf
>>> wG U Ny;D
>>>
>>> -----
>>> And here are the first few lines of Matthew copied from the
>>> Pagemaker file:
>>> -----
>>> Sifrmaw;OD; {0Ha*vdusrf;
>>> The Gospel According to Matthew
>>> ed'gef;
>>> usrf;�yyk*�dKvf &Sifrmaw;OD;\b0rSwfwrf;
>>> usrf;�yyk*�dKvf &Sifrmaw;OD;onf *gavav;,e,frS *sL;vlrsKd;
>>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf tcGefcHoltjzpf
>>> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD
>>> ol\trnfrSm av0djzpf\/ olonf wdab;&d,tkdifteD;wGif
>>> a,Zl;ocifESifhawGU NyD;
>>>
>>>
>>> You can see that some letters have changed, and some others are
>>> in a different order.
>>>
>>> The letters that change are likely those points that aren't
>>> compatible with unicode, and pagemaker reassigned them to ensure
>>> that the file is more widely viewable. Since a conversion is
>>> already planned, these won't matter as much, but the font
>>> embedded in the PDF is different than the font attached to the
>>> pagemaker file, If you do start from the PDF, you'll need to
>>> extract the font to get the code points.
>>>
>>> The problem is that the PDF export from pagemaker sorts the
>>> letters into the order they appear on the page. Burmese text
>>> has Indian style ligatures, where vowels tend to jump over or
>>> under the previous letters, sometimes back 2 or three letters.
>>> If you study the following snippets from the beginning of
>>> Matthew, you can see there is a difference in order, as well as
>>> some glyphs are modified.
>>>
>>> So, from the PDF letters are out of order, but from Pagemaker,
>>> letters are encoded into control points. Fixing the control
>>> points is easy and happens with the unicode conversion. Fixing
>>> the letter order is not easy. You'll need a first language
>>> speaker and plenty of time.
>>>
>>> The guidance I received on another group was to use either LO
>>> Draw or Indesign to export the text from Pagemaker. I'll look
>>> into LO Draw again, but I don't have access to an older version
>>> of Indesign (the pagemaker import was removed in CS6).
>>>
>>>
>>> On Mon, May 13, 2019 at 10:40 AM Michael H <cmahte at gmail.com
>>> <mailto:cmahte at gmail.com>> wrote:
>>>
>>> I unzipped the pagemaker file, and when I open
>>> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can
>>> 'find' all of the book names, and see the text there.
>>>
>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip
>>> and open it with a zip archive progeram. The text is in the
>>> Pagemaker file at the top level of the archive, but encoded
>>> with a lot of extraneous information. (The English text
>>> "Matthew" appears at hex location 7A76972).
>>>
>>> When I open the fonts with fontforge, Fontforge suggests the
>>> fonts are encoded as unicode (but the glyphs are obviously
>>> not in the right spot.)
>>> However when I copy the text (I copied from LO Draw) and
>>> paste it into jedit and save that as unicode: Reopening the
>>> file has a warning 'not unicode, text may be missing'.
>>>
>>> So, what this means is that there are some glyphs encoded
>>> into locations that unicode treats as control or
>>> non-printing codes. The text needs to be dealt with as a
>>> specific encoding that matches whatever the original font
>>> actually uses. I haven't figured out what the original text
>>> files were encoded with. Without that knowledge, I'm not
>>> sure my system clipboard or editor (jedit) will properly
>>> respect the glyphs in unusual locations until the conversion
>>> to unicode, and I don't trust myself to be able to detect if
>>> it is or is not properly converted.
>>>
>>> On Mon, May 13, 2019 at 10:11 AM Cyrille
>>> <lafricain79 at gmail.com <mailto:lafricain79 at gmail.com>> wrote:
>>>
>>> David,
>>> Probably you are right about TECkit
>>> <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>,
>>> if we get the text it will help us to convert in UNICODE.
>>> About how to get the text, your method is out of my
>>> skills :)
>>> I you succeed please let me know.
>>>
>>> Il 13/05/2019 16:21, David Haslam ha scritto:
>>>> Given the insights from Michael Hart, it may be
>>>> feasible to temporarily rearrange the main text stream
>>>> as follows :
>>>>
>>>> 1. Replace every EOL by a horizontal tab.
>>>> 2. Insert an EOL after each verse end character.
>>>>
>>>> Observe that the above two steps are wholly reversible
>>>> such that the original text stream can be restored later.
>>>>
>>>> In effect the text stream is now in verse per line
>>>> (VPL) layout, albeit without verse tags. Some
>>>> adjustments may be necessary if there any section
>>>> headings, etc.
>>>>
>>>> 3. Add line numbers with the first number being reset
>>>> to 1 at the start of each chapter, numbers incrementing
>>>> by 1 for each line.
>>>> 4. Add a left margin USFM verse tag \v_
>>>>
>>>> Steps 3&4 can be implemented in various ways. For my
>>>> part, I’d use a bespoke TextPipe filter.
>>>>
>>>> Another method to consider might be to use Excel
>>>> formulae. I recall resorting to such a method in the
>>>> early days of Go Bible.
>>>>
>>>> Now restore the original layout by reverting steps 2 &
>>>> 1, if this is really necessary. That is, if the
>>>> original text layout appeared to be paragraphed.
>>>>
>>>> 5. Decide how & where to insert paragraph tags.
>>>>
>>>> 6. Add chapter tags, book ID and main title tags, etc.
>>>>
>>>> Hope this gives some useful suggestions that point
>>>> towards a practical solution.
>>>>
>>>> Best regards
>>>>
>>>> David
>>>>
>>>>
>>>> Sent from ProtonMail Mobile
>>>>
>>>>
>>>> On Mon, May 13, 2019 at 14:57, Michael H
>>>> <cmahte at gmail.com <mailto:cmahte at gmail.com>> wrote:
>>>>> Cyrille
>>>>>
>>>>> LibreOffice Draw attempts to open the pagemaker file,
>>>>> with limited success. But it confirms that even in the
>>>>> pagemaker source, the verse numbers are a separate
>>>>> text stream. With this source, there is no way to copy
>>>>> the text with verse numbers intact. It appears to be
>>>>> stored with each book in it's own text stream. Each
>>>>> book is a separate text stream in the page maker file.
>>>>> LO Draw isn't rendering all of the pages, only the
>>>>> first 10, So I've only explored Matthew further.
>>>>>
>>>>> Based on Matthew only, the verses seem to all end with
>>>>> the character "-" or ";/", which should aid in the
>>>>> reconstruction. I've looked through the PDF and this
>>>>> seems to be the case for all books visually as well.
>>>>> However, this isn't perfect: I find 1107 of these
>>>>> characters in Matthew, instead of the expected 1071
>>>>> verses. But since the text stream has a book
>>>>> introduction, this is likely easily explained.
>>>>> Hopefully this gets you well down the path to creating
>>>>> a stream with verses.
>>>>>
>>>>> I would NOT start from the PDF file, but from the
>>>>> pagemaker file. The PDF almost certainly has a lot of
>>>>> text rearranging and extra characters like page
>>>>> numbers and running heads. Pagemaker has the book
>>>>> text in a single stream, in a form that will convert
>>>>> to unicode relatively easily.
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> <mailto:sword-devel at crosswire.org>
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at
>>> above page
>>>
>>>
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> <mailto:sword-devel at crosswire.org>
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/d683b522/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: TIT.zip
Type: application/zip
Size: 6139 bytes
Desc: not available
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/d683b522/attachment-0001.zip>
More information about the sword-devel
mailing list