<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<br>
<br>
<div class="moz-cite-prefix">Il 14/05/2019 22:26, David Haslam ha
scritto:<br>
</div>
<blockquote type="cite"
cite="mid:9JK6fdu-uy4_G3T4aCrM581jrnnd4GAExQ3bJV_Ayu8AovpRa_U-GG8XldgmdXx_s40UpA3rzrBCADGwZgvOkt0NhZxkHdKeCO9QeGNsT14=@protonmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div>If Michael’s observations are anything to go by, then maybe I
can script the recovery of chapter & verse tags. </div>
<div><br>
</div>
<div>We shall see ....</div>
<div><br>
</div>
<div>Even if I’m not immediately successful - valuable lessons can
be learned in the attempt. <br>
</div>
</blockquote>
Very, well, I'll wait for you ;)<br>
<blockquote type="cite"
cite="mid:9JK6fdu-uy4_G3T4aCrM581jrnnd4GAExQ3bJV_Ayu8AovpRa_U-GG8XldgmdXx_s40UpA3rzrBCADGwZgvOkt0NhZxkHdKeCO9QeGNsT14=@protonmail.com">
<div><br>
</div>
<div>David</div>
<div><br>
</div>
<div id="protonmail_mobile_signature_block">
<div>Sent from ProtonMail Mobile</div>
</div>
<div><br>
</div>
<div><br>
</div>
On Tue, May 14, 2019 at 21:21, Cyrille <<a
href="mailto:lafricain79@gmail.com" class=""
moz-do-not-send="true">lafricain79@gmail.com</a>> wrote:
<blockquote class="protonmail_quote" type="cite"> Ok thank you! I
have already all the text in unicode but without the verse
numbers and chapters... I begun manually...<br>
<br>
<div class="moz-cite-prefix">Il 14/05/2019 22:17, David Haslam
ha scritto:<br>
</div>
<blockquote type="cite">
<div>Hi Cyrille </div>
<div><br>
</div>
<div>If I can find the time tomorrow or later, I’ll have a
look at what might be feasible. </div>
<div><br>
</div>
<div>Thanks for all these useful links. </div>
<div><br>
</div>
<div>David</div>
<div><br>
</div>
<div id="protonmail_mobile_signature_block">
<div>Sent from ProtonMail Mobile</div>
</div>
<div><br>
</div>
<div><br>
</div>
On Tue, May 14, 2019 at 14:08, Cyrille <<a
href="mailto:lafricain79@gmail.com" class=""
moz-do-not-send="true">lafricain79@gmail.com</a>> wrote:
<blockquote class="protonmail_quote" type="cite"> I send my
message again because it was bigger.<br>
<br>
The conversion to UTF-8 is 99% solved!! I used a online
converter:<br>
<a class="moz-txt-link-freetext"
href="https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html"
moz-do-not-send="true">https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html</a><br>
or:<br>
<a class="moz-txt-link-freetext"
href="http://burglish.my-mm.org/latest/trunk/web/fontconv.htm"
moz-do-not-send="true">http://burglish.my-mm.org/latest/trunk/web/fontconv.htm</a><br>
<br>
See the result <a
href="https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA="
moz-do-not-send="true">here</a>.<br>
<br>
Now the only problem is how to get the verse and chapter
number... <br>
<br>
<br>
<div class="moz-cite-prefix">Il 14/05/2019 13:53, Michael H
ha scritto:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div class="gmail_default"><font size="4"
face="garamond,
 serif">Cyrille, (Peter), <br>
<br>
Maybe further discussion on this belongs in
Gitlab as issues. Can I get added to this
project? <br>
<br>
Here are the first few lines of Matthew copied
from the PDF: </font><br>
------<br>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">&Sifrmaw;OD;
{0Ha*vdusrf;</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">The
Gospel According to Matthew</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">ed'gef;</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">usr;f
ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">usr;f
ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf
S*sL;vrl sK;d tmvaf z;O;D \om;jzp\f / (rmu
k2;14)</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">olonf
tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">av0djzp\f
/ ool n f wad b;&,d tidk tf e;DwGi f
a,Z;lociEf iS ahf wG U Ny;D<br>
<br>
</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">-----</div>
<div class="gmail_default"><font size="4"
face="garamond,
 serif">And here are the
first few lines of Matthew copied from the
Pagemaker file: </font></div>
<div class="gmail_default"><font size="4"
face="garamond,
 serif">-----<br>
</font>
<div class="gmail_default"><font size="4"
face="garamond, serif">Sifrmaw;OD;
{0Ha*vdusrf;</font></div>
<div class="gmail_default"><font size="4"
face="garamond, serif">The Gospel According
to Matthew</font></div>
<div class="gmail_default"><span
style="font-family:garamond,serif;font-size:large">ed'gef;</span><br>
</div>
<div class="gmail_default"><span
style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf
&Sifrmaw;OD;\b0rSwfwrf; </span><br>
</div>
<div class="gmail_default"><span
style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf
&Sifrmaw;OD;onf *gavav;,e,frS
*sL;vlrsKd; tmvfaz;OD;\om;jzpf\/ (rmuk 2;14)
olonf tcGefcHoltjzpf trIxrf;chJonf/ (vk
5;27) a,Zl;ocif\aemufvdkufwynfhrjzpfrD
ol\trnfrSm av0djzpf\/ olonf
wdab;&d,tkdifteD;wGif
a,Zl;ocifESifhawGU NyD;<br>
<br>
<br>
You can see that some letters have changed,
and some others are in a different order. <br>
<br>
</span><span
style="font-family:garamond,serif;font-size:large">The
letters that change are likely those points
that aren't compatible with unicode, and
pagemaker reassigned them to ensure that the
file is more widely viewable. Since a
conversion is already planned, these won't
matter as much, but the font embedded in the
PDF is different than the font attached to
the pagemaker file, If you do start from
the PDF, you'll need to extract the font to
get the code points. </span><br
style="font-family:garamond,serif;font-size:large">
<span
style="font-family:garamond,serif;font-size:large"><br>
The problem is that the PDF export from
pagemaker sorts the letters into the order
they appear on the page. Burmese text has
Indian style ligatures, where vowels tend to
jump over or under the previous letters,
sometimes back 2 or three letters. If you
study the following snippets from the
beginning of Matthew, you can see there is a
difference in order, as well as some glyphs
are modified. <br>
<br>
So, from the PDF letters are out of order,
but from Pagemaker, letters are encoded into
control points. Fixing the control points is
easy and happens with the unicode
conversion. Fixing the letter order is not
easy. You'll need a first language speaker
and plenty of time. </span></div>
<div class="gmail_default"><span
style="font-family:garamond,serif;font-size:large"><br>
The guidance I received on another group was
to use either LO Draw or Indesign to export
the text from Pagemaker. I'll look into LO
Draw again, but I don't have access to an
older version of Indesign (the pagemaker
import was removed in CS6). </span><span
style="font-family:garamond,serif;font-size:large"><br>
</span></div>
</div>
</div>
</div>
</div>
</div>
<div dir="ltr">
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large"><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, May 13, 2019
at 10:40 AM Michael H <<a
href="mailto:cmahte@gmail.com"
moz-do-not-send="true">cmahte@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px
0px
 0px
 0.8ex;border-left:1px solid

rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">I
unzipped the pagemaker file, and when I open
NT_Proverb/Pagemaker (10.1mb), with a Hex editor,
I can 'find' all of the book names, and see the
text there. <br>
<br>
To see the raw text: rename NT_Proverb.pmd >
NT_Proverb.zip and open it with a zip archive
progeram. The text is in the Pagemaker file at
the top level of the archive, but encoded with a
lot of extraneous information. (The English text
"Matthew" appears at hex location 7A76972). <br>
<br>
When I open the fonts with fontforge, Fontforge
suggests the fonts are encoded as unicode (but the
glyphs are obviously not in the right spot.) <br>
However when I copy the text (I copied from LO
Draw) and paste it into jedit and save that as
unicode: Reopening the file has a warning 'not
unicode, text may be missing'. <br>
<br>
So, what this means is that there are some glyphs
encoded into locations that unicode treats as
control or non-printing codes. The text needs to
be dealt with as a specific encoding that matches
whatever the original font actually uses. I
haven't figured out what the original text files
were encoded with. Without that knowledge, I'm not
sure my system clipboard or editor (jedit) will
properly respect the glyphs in unusual locations
until the conversion to unicode, and I don't trust
myself to be able to detect if it is or is not
properly converted. <br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, May 13,
2019 at 10:11 AM Cyrille <<a
href="mailto:lafricain79@gmail.com"
moz-do-not-send="true">lafricain79@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px
0px
 0px
 0.8ex;border-left:1px
solid

 rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> David,<br>
Probably you are right about <a
href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit"
moz-do-not-send="true">TECkit</a>, if we get
the text it will help us to convert in UNICODE.<br>
About how to get the text, your method is out of
my skills :)<br>
I you succeed please let me know.<br>
<br>
<div
class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-cite-prefix">Il
13/05/2019 16:21, David Haslam ha scritto:<br>
</div>
<blockquote type="cite">
<div>Given the insights from Michael Hart, it
may be feasible to temporarily rearrange the
main text stream as follows :</div>
<div><br>
</div>
<div>1. Replace every EOL by a horizontal
tab. </div>
<div>2. Insert an EOL after each verse end
character. </div>
<div><br>
</div>
<div>Observe that the above two steps are
wholly reversible such that the original
text stream can be restored later. </div>
<div><br>
</div>
<div>In effect the text stream is now in verse
per line (VPL) layout, albeit without verse
tags. Some adjustments may be necessary if
there any section headings, etc. </div>
<div><br>
</div>
<div>3. Add line numbers with the first number
being reset to 1 at the start of each
chapter, numbers incrementing by 1 for each
line. </div>
<div>4. Add a left margin USFM verse tag \v_<br>
</div>
<div><br>
</div>
<div
id="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_mobile_signature_block">
<div>Steps 3&4 can be implemented in
various ways. For my part, I’d use a
bespoke TextPipe filter. </div>
<div><br>
</div>
<div>Another method to consider might be to
use Excel formulae. I recall resorting to
such a method in the early days of Go
Bible. </div>
<div><br>
</div>
<div>Now restore the original layout by
reverting steps 2 & 1, if this is
really necessary. That is, if the original
text layout appeared to be paragraphed. </div>
<div><br>
</div>
<div>5. Decide how & where to insert
paragraph tags. </div>
<div><br>
</div>
<div>6. Add chapter tags, book ID and main
title tags, etc. </div>
<div><br>
</div>
<div>Hope this gives some useful suggestions
that point towards a practical solution. </div>
<div><br>
</div>
<div>Best regards </div>
<div><br>
</div>
<div>David</div>
<div><br>
</div>
<div><br>
</div>
<div>Sent from ProtonMail Mobile</div>
</div>
<div><br>
</div>
<div><br>
</div>
On Mon, May 13, 2019 at 14:57, Michael H <<a
href="mailto:cmahte@gmail.com"
moz-do-not-send="true">cmahte@gmail.com</a>>
wrote:
<blockquote
class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_quote"
type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">Cyrille<br>
<br>
LibreOffice Draw attempts to open
the pagemaker file, with limited
success. But it confirms that even
in the pagemaker source, the verse
numbers are a separate text
stream. With this source, there is
no way to copy the text with verse
numbers intact. It appears to be
stored with each book in it's own
text stream. Each book is a
separate text stream in the page
maker file. LO Draw isn't
rendering all of the pages, only
the first 10, So I've only
explored Matthew further. <br>
<br>
Based on Matthew only, the verses
seem to all end with the character
"-" or ";/", which should aid in
the reconstruction. I've looked
through the PDF and this seems to
be the case for all books visually
as well. However, this isn't
perfect: I find 1107 of these
characters in Matthew, instead of
the expected 1071 verses. But
since the text stream has a book
introduction, this is likely
easily explained. Hopefully this
gets you well down the path to
creating a stream with verses. <br>
<br>
I would NOT start from the PDF
file, but from the pagemaker
file. The PDF almost certainly
has a lot of text rearranging and
extra characters like page numbers
and running heads. Pagemaker has
the book text in a single stream,
in a form that will convert to
unicode relatively easily. </div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large"><br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<br>
<fieldset
class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</div>
_______________________________________________<br>
sword-devel mailing list: <a
href="mailto:sword-devel@crosswire.org"
moz-do-not-send="true">sword-devel@crosswire.org</a><br>
<a
href="http://www.crosswire.org/mailman/listinfo/sword-devel"
rel="noreferrer" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
Instructions to unsubscribe/change your settings
at above page</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</body>
</html>