[sword-devel] Bible in Myanmar
Cyrille
lafricain79 at gmail.com
Tue May 14 13:57:43 MST 2019
Il 14/05/2019 22:48, Michael H ha scritto:
> You should be able to configure a regex search to find the verse
> boundaries.
>
> Once you have verse boundaries, if you configure the text into Verse
> per line it should be possible to assign each row a chapter and verse
> number from a reference. That is, the 3341 verse in the New Testament
> is usually John 20:31 (I don't have that memorized, just an example.)
I have no idea how to do this :)
>
> On Tue, May 14, 2019 at 3:22 PM Cyrille <lafricain79 at gmail.com
> <mailto:lafricain79 at gmail.com>> wrote:
>
> Ok thank you! I have already all the text in unicode but without
> the verse numbers and chapters... I begun manually...
>
> Il 14/05/2019 22:17, David Haslam ha scritto:
>> Hi Cyrille
>>
>> If I can find the time tomorrow or later, I’ll have a look at
>> what might be feasible.
>>
>> Thanks for all these useful links.
>>
>> David
>>
>> Sent from ProtonMail Mobile
>>
>>
>> On Tue, May 14, 2019 at 14:08, Cyrille <lafricain79 at gmail.com
>> <mailto:lafricain79 at gmail.com>> wrote:
>>> I send my message again because it was bigger.
>>>
>>> The conversion to UTF-8 is 99% solved!! I used a online converter:
>>> https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html
>>> or:
>>> http://burglish.my-mm.org/latest/trunk/web/fontconv.htm
>>>
>>> See the result here
>>> <https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=>.
>>>
>>> Now the only problem is how to get the verse and chapter number...
>>>
>>>
>>> Il 14/05/2019 13:53, Michael H ha scritto:
>>>> Cyrille, (Peter),
>>>>
>>>> Maybe further discussion on this belongs in Gitlab as issues.
>>>> Can I get added to this project?
>>>>
>>>> Here are the first few lines of Matthew copied from the PDF:
>>>> ------
>>>> &Sifrmaw;OD; {0Ha*vdusrf;
>>>> The Gospel According to Matthew
>>>> ed'gef;
>>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f
>>>> usr;f ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf S*sL;vrl
>>>> sK;d tmvaf z;O;D \om;jzp\f / (rmu k2;14)
>>>> olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
>>>> a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
>>>> av0djzp\f / ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf iS
>>>> ahf wG U Ny;D
>>>>
>>>> -----
>>>> And here are the first few lines of Matthew copied from the
>>>> Pagemaker file:
>>>> -----
>>>> Sifrmaw;OD; {0Ha*vdusrf;
>>>> The Gospel According to Matthew
>>>> ed'gef;
>>>> usrf;�yyk*�dKvf &Sifrmaw;OD;\b0rSwfwrf;
>>>> usrf;�yyk*�dKvf &Sifrmaw;OD;onf *gavav;,e,frS *sL;vlrsKd;
>>>> tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf tcGefcHoltjzpf
>>>> trIxrf;chJonf/ (vk 5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD
>>>> ol\trnfrSm av0djzpf\/ olonf wdab;&d,tkdifteD;wGif
>>>> a,Zl;ocifESifhawGU NyD;
>>>>
>>>>
>>>> You can see that some letters have changed, and some others are
>>>> in a different order.
>>>>
>>>> The letters that change are likely those points that aren't
>>>> compatible with unicode, and pagemaker reassigned them to
>>>> ensure that the file is more widely viewable. Since a
>>>> conversion is already planned, these won't matter as much, but
>>>> the font embedded in the PDF is different than the font
>>>> attached to the pagemaker file, If you do start from the PDF,
>>>> you'll need to extract the font to get the code points.
>>>>
>>>> The problem is that the PDF export from pagemaker sorts the
>>>> letters into the order they appear on the page. Burmese text
>>>> has Indian style ligatures, where vowels tend to jump over or
>>>> under the previous letters, sometimes back 2 or three letters.
>>>> If you study the following snippets from the beginning of
>>>> Matthew, you can see there is a difference in order, as well as
>>>> some glyphs are modified.
>>>>
>>>> So, from the PDF letters are out of order, but from Pagemaker,
>>>> letters are encoded into control points. Fixing the control
>>>> points is easy and happens with the unicode conversion. Fixing
>>>> the letter order is not easy. You'll need a first language
>>>> speaker and plenty of time.
>>>>
>>>> The guidance I received on another group was to use either LO
>>>> Draw or Indesign to export the text from Pagemaker. I'll look
>>>> into LO Draw again, but I don't have access to an older version
>>>> of Indesign (the pagemaker import was removed in CS6).
>>>>
>>>>
>>>> On Mon, May 13, 2019 at 10:40 AM Michael H <cmahte at gmail.com
>>>> <mailto:cmahte at gmail.com>> wrote:
>>>>
>>>> I unzipped the pagemaker file, and when I open
>>>> NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I can
>>>> 'find' all of the book names, and see the text there.
>>>>
>>>> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip
>>>> and open it with a zip archive progeram. The text is in
>>>> the Pagemaker file at the top level of the archive, but
>>>> encoded with a lot of extraneous information. (The English
>>>> text "Matthew" appears at hex location 7A76972).
>>>>
>>>> When I open the fonts with fontforge, Fontforge suggests
>>>> the fonts are encoded as unicode (but the glyphs are
>>>> obviously not in the right spot.)
>>>> However when I copy the text (I copied from LO Draw) and
>>>> paste it into jedit and save that as unicode: Reopening the
>>>> file has a warning 'not unicode, text may be missing'.
>>>>
>>>> So, what this means is that there are some glyphs encoded
>>>> into locations that unicode treats as control or
>>>> non-printing codes. The text needs to be dealt with as a
>>>> specific encoding that matches whatever the original font
>>>> actually uses. I haven't figured out what the original text
>>>> files were encoded with. Without that knowledge, I'm not
>>>> sure my system clipboard or editor (jedit) will properly
>>>> respect the glyphs in unusual locations until the
>>>> conversion to unicode, and I don't trust myself to be able
>>>> to detect if it is or is not properly converted.
>>>>
>>>> On Mon, May 13, 2019 at 10:11 AM Cyrille
>>>> <lafricain79 at gmail.com <mailto:lafricain79 at gmail.com>> wrote:
>>>>
>>>> David,
>>>> Probably you are right about TECkit
>>>> <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>,
>>>> if we get the text it will help us to convert in UNICODE.
>>>> About how to get the text, your method is out of my
>>>> skills :)
>>>> I you succeed please let me know.
>>>>
>>>> Il 13/05/2019 16:21, David Haslam ha scritto:
>>>>> Given the insights from Michael Hart, it may be
>>>>> feasible to temporarily rearrange the main text stream
>>>>> as follows :
>>>>>
>>>>> 1. Replace every EOL by a horizontal tab.
>>>>> 2. Insert an EOL after each verse end character.
>>>>>
>>>>> Observe that the above two steps are wholly reversible
>>>>> such that the original text stream can be restored later.
>>>>>
>>>>> In effect the text stream is now in verse per line
>>>>> (VPL) layout, albeit without verse tags. Some
>>>>> adjustments may be necessary if there any section
>>>>> headings, etc.
>>>>>
>>>>> 3. Add line numbers with the first number being reset
>>>>> to 1 at the start of each chapter, numbers
>>>>> incrementing by 1 for each line.
>>>>> 4. Add a left margin USFM verse tag \v_
>>>>>
>>>>> Steps 3&4 can be implemented in various ways. For my
>>>>> part, I’d use a bespoke TextPipe filter.
>>>>>
>>>>> Another method to consider might be to use Excel
>>>>> formulae. I recall resorting to such a method in the
>>>>> early days of Go Bible.
>>>>>
>>>>> Now restore the original layout by reverting steps 2 &
>>>>> 1, if this is really necessary. That is, if the
>>>>> original text layout appeared to be paragraphed.
>>>>>
>>>>> 5. Decide how & where to insert paragraph tags.
>>>>>
>>>>> 6. Add chapter tags, book ID and main title tags, etc.
>>>>>
>>>>> Hope this gives some useful suggestions that point
>>>>> towards a practical solution.
>>>>>
>>>>> Best regards
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>> Sent from ProtonMail Mobile
>>>>>
>>>>>
>>>>> On Mon, May 13, 2019 at 14:57, Michael H
>>>>> <cmahte at gmail.com <mailto:cmahte at gmail.com>> wrote:
>>>>>> Cyrille
>>>>>>
>>>>>> LibreOffice Draw attempts to open the pagemaker file,
>>>>>> with limited success. But it confirms that even in
>>>>>> the pagemaker source, the verse numbers are a
>>>>>> separate text stream. With this source, there is no
>>>>>> way to copy the text with verse numbers intact. It
>>>>>> appears to be stored with each book in it's own text
>>>>>> stream. Each book is a separate text stream in the
>>>>>> page maker file. LO Draw isn't rendering all of the
>>>>>> pages, only the first 10, So I've only explored
>>>>>> Matthew further.
>>>>>>
>>>>>> Based on Matthew only, the verses seem to all end
>>>>>> with the character "-" or ";/", which should aid in
>>>>>> the reconstruction. I've looked through the PDF and
>>>>>> this seems to be the case for all books visually as
>>>>>> well. However, this isn't perfect: I find 1107 of
>>>>>> these characters in Matthew, instead of the expected
>>>>>> 1071 verses. But since the text stream has a book
>>>>>> introduction, this is likely easily explained.
>>>>>> Hopefully this gets you well down the path to
>>>>>> creating a stream with verses.
>>>>>>
>>>>>> I would NOT start from the PDF file, but from the
>>>>>> pagemaker file. The PDF almost certainly has a lot
>>>>>> of text rearranging and extra characters like page
>>>>>> numbers and running heads. Pagemaker has the book
>>>>>> text in a single stream, in a form that will convert
>>>>>> to unicode relatively easily.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>>> Instructions to unsubscribe/change your settings at above page
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list: sword-devel at crosswire.org
>>>> <mailto:sword-devel at crosswire.org>
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at
>>>> above page
>>>>
>>>>
>>>> _______________________________________________
>>>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>>> Instructions to unsubscribe/change your settings at above page
>>>
>>
>>
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> <mailto:sword-devel at crosswire.org>
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/5a4bb0e8/attachment-0001.html>
More information about the sword-devel
mailing list