<div>If Michael’s observations are anything to go by, then maybe I can script the recovery of chapter & verse tags. </div><div><br></div><div>We shall see ....</div><div><br></div><div>Even if I’m not immediately successful - valuable lessons can be learned in the attempt. </div><div><br></div><div>David</div><div><br></div><div id="protonmail_mobile_signature_block"><div>Sent from ProtonMail Mobile</div></div> <div><br></div><div><br></div>On Tue, May 14, 2019 at 21:21, Cyrille <<a href="mailto:lafricain79@gmail.com" class="">lafricain79@gmail.com</a>> wrote:<blockquote class="protonmail_quote" type="cite">
Ok thank you! I have already all the text in unicode but without
the verse numbers and chapters... I begun manually...<br>
<br>
<div class="moz-cite-prefix">Il 14/05/2019 22:17, David Haslam ha
scritto:<br>
</div>
<blockquote type="cite">
<div>Hi Cyrille </div>
<div><br>
</div>
<div>If I can find the time tomorrow or later, I’ll have a look at
what might be feasible. </div>
<div><br>
</div>
<div>Thanks for all these useful links. </div>
<div><br>
</div>
<div>David</div>
<div><br>
</div>
<div id="protonmail_mobile_signature_block">
<div>Sent from ProtonMail Mobile</div>
</div>
<div><br>
</div>
<div><br>
</div>
On Tue, May 14, 2019 at 14:08, Cyrille <<a href="mailto:lafricain79@gmail.com" class="">lafricain79@gmail.com</a>> wrote:
<blockquote class="protonmail_quote" type="cite"> I send my
message again because it was bigger.<br>
<br>
The conversion to UTF-8 is 99% solved!! I used a online
converter:<br>
<a class="moz-txt-link-freetext" href="https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html">https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html</a><br>
or:<br>
<a class="moz-txt-link-freetext" href="http://burglish.my-mm.org/latest/trunk/web/fontconv.htm">http://burglish.my-mm.org/latest/trunk/web/fontconv.htm</a><br>
<br>
See the result <a href="https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=">here</a>.<br>
<br>
Now the only problem is how to get the verse and chapter
number... <br>
<br>
<br>
<div class="moz-cite-prefix">Il 14/05/2019 13:53, Michael H ha
scritto:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div class="gmail_default"><font size="4" face="garamond,
serif">Cyrille, (Peter), <br>
<br>
Maybe further discussion on this belongs in Gitlab
as issues. Can I get added to this project? <br>
<br>
Here are the first few lines of Matthew copied from
the PDF: </font><br>
------<br>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">&Sifrmaw;OD;
{0Ha*vdusrf;</div>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">The
Gospel According to Matthew</div>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">ed'gef;</div>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">usr;f
ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf r;f</div>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">usr;f
ûyy*k Kd¾v f &iS rf maw;O;Don f *gavav;,e,rf
S*sL;vrl sK;d tmvaf z;O;D \om;jzp\f / (rmu k2;14)</div>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">olonf
tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm</div>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">av0djzp\f
/ ool n f wad b;&,d tidk tf e;DwGi f a,Z;lociEf
iS ahf wG U Ny;D<br>
<br>
</div>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">-----</div>
<div class="gmail_default"><font size="4" face="garamond,
serif">And here are the first
few lines of Matthew copied from the Pagemaker
file: </font></div>
<div class="gmail_default"><font size="4" face="garamond,
serif">-----<br>
</font>
<div class="gmail_default"><font size="4" face="garamond, serif">Sifrmaw;OD; {0Ha*vdusrf;</font></div>
<div class="gmail_default"><font size="4" face="garamond, serif">The Gospel According to
Matthew</font></div>
<div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">ed'gef;</span><br>
</div>
<div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf
&Sifrmaw;OD;\b0rSwfwrf; </span><br>
</div>
<div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf
&Sifrmaw;OD;onf *gavav;,e,frS *sL;vlrsKd;
tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf
tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm
av0djzpf\/ olonf wdab;&d,tkdifteD;wGif
a,Zl;ocifESifhawGU NyD;<br>
<br>
<br>
You can see that some letters have changed, and
some others are in a different order. <br>
<br>
</span><span style="font-family:garamond,serif;font-size:large">The
letters that change are likely those points that
aren't compatible with unicode, and pagemaker
reassigned them to ensure that the file is more
widely viewable. Since a conversion is already
planned, these won't matter as much, but the
font embedded in the PDF is different than the
font attached to the pagemaker file, If you do
start from the PDF, you'll need to extract the
font to get the code points. </span><br style="font-family:garamond,serif;font-size:large">
<span style="font-family:garamond,serif;font-size:large"><br>
The problem is that the PDF export from
pagemaker sorts the letters into the order they
appear on the page. Burmese text has Indian
style ligatures, where vowels tend to jump over
or under the previous letters, sometimes back 2
or three letters. If you study the following
snippets from the beginning of Matthew, you can
see there is a difference in order, as well as
some glyphs are modified. <br>
<br>
So, from the PDF letters are out of order, but
from Pagemaker, letters are encoded into control
points. Fixing the control points is easy and
happens with the unicode conversion. Fixing the
letter order is not easy. You'll need a first
language speaker and plenty of time. </span></div>
<div class="gmail_default"><span style="font-family:garamond,serif;font-size:large"><br>
The guidance I received on another group was to
use either LO Draw or Indesign to export the
text from Pagemaker. I'll look into LO Draw
again, but I don't have access to an older
version of Indesign (the pagemaker import was
removed in CS6). </span><span style="font-family:garamond,serif;font-size:large"><br>
</span></div>
</div>
</div>
</div>
</div>
</div>
<div dir="ltr">
<div class="gmail_default" style="font-family:garamond,serif;font-size:large"><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, May 13, 2019 at
10:40 AM Michael H <<a href="mailto:cmahte@gmail.com">cmahte@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">I
unzipped the pagemaker file, and when I open
NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I
can 'find' all of the book names, and see the text
there. <br>
<br>
To see the raw text: rename NT_Proverb.pmd >
NT_Proverb.zip and open it with a zip archive
progeram. The text is in the Pagemaker file at the
top level of the archive, but encoded with a lot of
extraneous information. (The English text "Matthew"
appears at hex location 7A76972). <br>
<br>
When I open the fonts with fontforge, Fontforge
suggests the fonts are encoded as unicode (but the
glyphs are obviously not in the right spot.) <br>
However when I copy the text (I copied from LO Draw)
and paste it into jedit and save that as unicode:
Reopening the file has a warning 'not unicode, text
may be missing'. <br>
<br>
So, what this means is that there are some glyphs
encoded into locations that unicode treats as control
or non-printing codes. The text needs to be dealt with
as a specific encoding that matches whatever the
original font actually uses. I haven't figured out
what the original text files were encoded with.
Without that knowledge, I'm not sure my system
clipboard or editor (jedit) will properly respect the
glyphs in unusual locations until the conversion to
unicode, and I don't trust myself to be able to detect
if it is or is not properly converted. <br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, May 13, 2019
at 10:11 AM Cyrille <<a href="mailto:lafricain79@gmail.com">lafricain79@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> David,<br>
Probably you are right about <a href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit">TECkit</a>, if we get the
text it will help us to convert in UNICODE.<br>
About how to get the text, your method is out of my
skills :)<br>
I you succeed please let me know.<br>
<br>
<div class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-cite-prefix">Il
13/05/2019 16:21, David Haslam ha scritto:<br>
</div>
<blockquote type="cite">
<div>Given the insights from Michael Hart, it may
be feasible to temporarily rearrange the main
text stream as follows :</div>
<div><br>
</div>
<div>1. Replace every EOL by a horizontal tab. </div>
<div>2. Insert an EOL after each verse end
character. </div>
<div><br>
</div>
<div>Observe that the above two steps are
wholly reversible such that the original text
stream can be restored later. </div>
<div><br>
</div>
<div>In effect the text stream is now in verse per
line (VPL) layout, albeit without verse tags.
Some adjustments may be necessary if there any
section headings, etc. </div>
<div><br>
</div>
<div>3. Add line numbers with the first number
being reset to 1 at the start of each chapter,
numbers incrementing by 1 for each line. </div>
<div>4. Add a left margin USFM verse tag \v_<br>
</div>
<div><br>
</div>
<div id="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_mobile_signature_block">
<div>Steps 3&4 can be implemented in various
ways. For my part, I’d use a bespoke TextPipe
filter. </div>
<div><br>
</div>
<div>Another method to consider might be to use
Excel formulae. I recall resorting to such a
method in the early days of Go Bible. </div>
<div><br>
</div>
<div>Now restore the original layout by
reverting steps 2 & 1, if this is really
necessary. That is, if the original text
layout appeared to be paragraphed. </div>
<div><br>
</div>
<div>5. Decide how & where to insert
paragraph tags. </div>
<div><br>
</div>
<div>6. Add chapter tags, book ID and main title
tags, etc. </div>
<div><br>
</div>
<div>Hope this gives some useful suggestions
that point towards a practical solution. </div>
<div><br>
</div>
<div>Best regards </div>
<div><br>
</div>
<div>David</div>
<div><br>
</div>
<div><br>
</div>
<div>Sent from ProtonMail Mobile</div>
</div>
<div><br>
</div>
<div><br>
</div>
On Mon, May 13, 2019 at 14:57, Michael H <<a href="mailto:cmahte@gmail.com">cmahte@gmail.com</a>>
wrote:
<blockquote class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_quote" type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div class="gmail_default" style="font-family:garamond,serif;font-size:large">Cyrille<br>
<br>
LibreOffice Draw attempts to open the
pagemaker file, with limited success.
But it confirms that even in the
pagemaker source, the verse numbers
are a separate text stream. With this
source, there is no way to copy the
text with verse numbers intact. It
appears to be stored with each book in
it's own text stream. Each book is a
separate text stream in the page maker
file. LO Draw isn't rendering all of
the pages, only the first 10, So I've
only explored Matthew further. <br>
<br>
Based on Matthew only, the verses seem
to all end with the character "-" or
";/", which should aid in the
reconstruction. I've looked through
the PDF and this seems to be the case
for all books visually as well.
However, this isn't perfect: I find
1107 of these characters in Matthew,
instead of the expected 1071 verses.
But since the text stream has a book
introduction, this is likely easily
explained. Hopefully this gets you
well down the path to creating a
stream with verses. <br>
<br>
I would NOT start from the PDF file,
but from the pagemaker file. The PDF
almost certainly has a lot of text
rearranging and extra characters like
page numbers and running heads.
Pagemaker has the book text in a
single stream, in a form that will
convert to unicode relatively easily. </div>
<div class="gmail_default" style="font-family:garamond,serif;font-size:large"><br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<br>
<fieldset class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</div>
_______________________________________________<br>
sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
<a href="http://www.crosswire.org/mailman/listinfo/sword-devel" rel="noreferrer">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
Instructions to unsubscribe/change your settings at
above page</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</blockquote><div><br></div><div><br></div>