<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<br>
<br>
<div class="moz-cite-prefix">Il 14/05/2019 22:48, Michael H ha
scritto:<br>
</div>
<blockquote type="cite"
cite="mid:CAJ9hia-speB1UPxm+CofuJg6L7VoT6mfx8bsQsNkYshEO-_Prw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">You should
be able to configure a regex search to find the verse
boundaries. <br>
<br>
Once you have verse boundaries, if you configure the text into
Verse per line it should be possible to assign each row a
chapter and verse number from a reference. That is, the 3341
verse in the New Testament is usually John 20:31 (I don't have
that memorized, just an example.) <br>
</div>
</div>
</blockquote>
<br>
I have no idea how to do this :)<br>
<blockquote type="cite"
cite="mid:CAJ9hia-speB1UPxm+CofuJg6L7VoT6mfx8bsQsNkYshEO-_Prw@mail.gmail.com"><br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, May 14, 2019 at 3:22
PM Cyrille <<a href="mailto:lafricain79@gmail.com"
moz-do-not-send="true">lafricain79@gmail.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> Ok thank you! I have already all the
text in unicode but without the verse numbers and
chapters... I begun manually...<br>
<br>
<div class="gmail-m_-4094282784364978796moz-cite-prefix">Il
14/05/2019 22:17, David Haslam ha scritto:<br>
</div>
<blockquote type="cite">
<div>Hi Cyrille </div>
<div><br>
</div>
<div>If I can find the time tomorrow or later, I’ll have a
look at what might be feasible. </div>
<div><br>
</div>
<div>Thanks for all these useful links. </div>
<div><br>
</div>
<div>David</div>
<div><br>
</div>
<div
id="gmail-m_-4094282784364978796protonmail_mobile_signature_block">
<div>Sent from ProtonMail Mobile</div>
</div>
<div><br>
</div>
<div><br>
</div>
On Tue, May 14, 2019 at 14:08, Cyrille <<a
href="mailto:lafricain79@gmail.com" target="_blank"
moz-do-not-send="true">lafricain79@gmail.com</a>>
wrote:
<blockquote
class="gmail-m_-4094282784364978796protonmail_quote"
type="cite"> I send my message again because it was
bigger.<br>
<br>
The conversion to UTF-8 is 99% solved!! I used a online
converter:<br>
<a
class="gmail-m_-4094282784364978796moz-txt-link-freetext"
href="https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html"
target="_blank" moz-do-not-send="true">https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html</a><br>
or:<br>
<a
class="gmail-m_-4094282784364978796moz-txt-link-freetext"
href="http://burglish.my-mm.org/latest/trunk/web/fontconv.htm"
target="_blank" moz-do-not-send="true">http://burglish.my-mm.org/latest/trunk/web/fontconv.htm</a><br>
<br>
See the result <a
href="https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA="
target="_blank" moz-do-not-send="true">here</a>.<br>
<br>
Now the only problem is how to get the verse and chapter
number... <br>
<br>
<br>
<div class="gmail-m_-4094282784364978796moz-cite-prefix">Il
14/05/2019 13:53, Michael H ha scritto:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div class="gmail_default"><font size="4"
face="garamond,
 serif">Cyrille,
(Peter), <br>
<br>
Maybe further discussion on this belongs in
Gitlab as issues. Can I get added to this
project? <br>
<br>
Here are the first few lines of Matthew
copied from the PDF: </font><br>
------<br>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">&Sifrmaw;OD;
{0Ha*vdusrf;</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">The
Gospel According to Matthew</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">ed'gef;</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">usr;f
ûyy*k Kd¾v f &iS rf maw;O;D \b0rwS wf
r;f</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">usr;f
ûyy*k Kd¾v f &iS rf maw;O;Don f
*gavav;,e,rf S*sL;vrl sK;d tmvaf z;O;D
\om;jzp\f / (rmu k2;14)</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">olonf
tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">av0djzp\f
/ ool n f wad b;&,d tidk tf e;DwGi f
a,Z;lociEf iS ahf wG U Ny;D<br>
<br>
</div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">-----</div>
<div class="gmail_default"><font size="4"
face="garamond,
 serif">And here are
the first few lines of Matthew copied from
the Pagemaker file: </font></div>
<div class="gmail_default"><font size="4"
face="garamond,
 serif">-----<br>
</font>
<div class="gmail_default"><font size="4"
face="garamond, serif">Sifrmaw;OD;
{0Ha*vdusrf;</font></div>
<div class="gmail_default"><font size="4"
face="garamond, serif">The Gospel
According to Matthew</font></div>
<div class="gmail_default"><span
style="font-family:garamond,serif;font-size:large">ed'gef;</span><br>
</div>
<div class="gmail_default"><span
style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf
&Sifrmaw;OD;\b0rSwfwrf; </span><br>
</div>
<div class="gmail_default"><span
style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf
&Sifrmaw;OD;onf *gavav;,e,frS
*sL;vlrsKd; tmvfaz;OD;\om;jzpf\/ (rmuk
2;14) olonf tcGefcHoltjzpf
trIxrf;chJonf/ (vk 5;27)
a,Zl;ocif\aemufvdkufwynfhrjzpfrD
ol\trnfrSm av0djzpf\/ olonf
wdab;&d,tkdifteD;wGif
a,Zl;ocifESifhawGU NyD;<br>
<br>
<br>
You can see that some letters have
changed, and some others are in a
different order. <br>
<br>
</span><span
style="font-family:garamond,serif;font-size:large">The
letters that change are likely those
points that aren't compatible with
unicode, and pagemaker reassigned them
to ensure that the file is more widely
viewable. Since a conversion is already
planned, these won't matter as much, but
the font embedded in the PDF is
different than the font attached to the
pagemaker file, If you do start from
the PDF, you'll need to extract the font
to get the code points. </span><br
style="font-family:garamond,serif;font-size:large">
<span
style="font-family:garamond,serif;font-size:large"><br>
The problem is that the PDF export from
pagemaker sorts the letters into the
order they appear on the page. Burmese
text has Indian style ligatures, where
vowels tend to jump over or under the
previous letters, sometimes back 2 or
three letters. If you study the
following snippets from the beginning of
Matthew, you can see there is a
difference in order, as well as some
glyphs are modified. <br>
<br>
So, from the PDF letters are out of
order, but from Pagemaker, letters are
encoded into control points. Fixing the
control points is easy and happens with
the unicode conversion. Fixing the
letter order is not easy. You'll need a
first language speaker and plenty of
time. </span></div>
<div class="gmail_default"><span
style="font-family:garamond,serif;font-size:large"><br>
The guidance I received on another group
was to use either LO Draw or Indesign to
export the text from Pagemaker. I'll
look into LO Draw again, but I don't
have access to an older version of
Indesign (the pagemaker import was
removed in CS6). </span><span
style="font-family:garamond,serif;font-size:large"><br>
</span></div>
</div>
</div>
</div>
</div>
</div>
<div dir="ltr">
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large"><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, May 13,
2019 at 10:40 AM Michael H <<a
href="mailto:cmahte@gmail.com" target="_blank"
moz-do-not-send="true">cmahte@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">I
unzipped the pagemaker file, and when I open
NT_Proverb/Pagemaker (10.1mb), with a Hex
editor, I can 'find' all of the book names,
and see the text there. <br>
<br>
To see the raw text: rename NT_Proverb.pmd
> NT_Proverb.zip and open it with a zip
archive progeram. The text is in the
Pagemaker file at the top level of the
archive, but encoded with a lot of extraneous
information. (The English text "Matthew"
appears at hex location 7A76972). <br>
<br>
When I open the fonts with fontforge,
Fontforge suggests the fonts are encoded as
unicode (but the glyphs are obviously not in
the right spot.) <br>
However when I copy the text (I copied from LO
Draw) and paste it into jedit and save that as
unicode: Reopening the file has a warning 'not
unicode, text may be missing'. <br>
<br>
So, what this means is that there are some
glyphs encoded into locations that unicode
treats as control or non-printing codes. The
text needs to be dealt with as a specific
encoding that matches whatever the original
font actually uses. I haven't figured out what
the original text files were encoded with.
Without that knowledge, I'm not sure my system
clipboard or editor (jedit) will properly
respect the glyphs in unusual locations until
the conversion to unicode, and I don't trust
myself to be able to detect if it is or is not
properly converted. <br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, May
13, 2019 at 10:11 AM Cyrille <<a
href="mailto:lafricain79@gmail.com"
target="_blank" moz-do-not-send="true">lafricain79@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"> David,<br>
Probably you are right about <a
href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit"
target="_blank" moz-do-not-send="true">TECkit</a>,
if we get the text it will help us to
convert in UNICODE.<br>
About how to get the text, your method is
out of my skills :)<br>
I you succeed please let me know.<br>
<br>
<div
class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-cite-prefix">Il
13/05/2019 16:21, David Haslam ha scritto:<br>
</div>
<blockquote type="cite">
<div>Given the insights from Michael Hart,
it may be feasible to temporarily
rearrange the main text stream as
follows :</div>
<div><br>
</div>
<div>1. Replace every EOL by a horizontal
tab. </div>
<div>2. Insert an EOL after each verse end
character. </div>
<div><br>
</div>
<div>Observe that the above two steps are
wholly reversible such that the original
text stream can be restored later. </div>
<div><br>
</div>
<div>In effect the text stream is now in
verse per line (VPL) layout, albeit
without verse tags. Some adjustments may
be necessary if there any section
headings, etc. </div>
<div><br>
</div>
<div>3. Add line numbers with the first
number being reset to 1 at the start of
each chapter, numbers incrementing by 1
for each line. </div>
<div>4. Add a left margin USFM verse tag
\v_<br>
</div>
<div><br>
</div>
<div
id="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_mobile_signature_block">
<div>Steps 3&4 can be implemented in
various ways. For my part, I’d use a
bespoke TextPipe filter. </div>
<div><br>
</div>
<div>Another method to consider might be
to use Excel formulae. I recall
resorting to such a method in the
early days of Go Bible. </div>
<div><br>
</div>
<div>Now restore the original layout by
reverting steps 2 & 1, if this is
really necessary. That is, if the
original text layout appeared to be
paragraphed. </div>
<div><br>
</div>
<div>5. Decide how & where to insert
paragraph tags. </div>
<div><br>
</div>
<div>6. Add chapter tags, book ID and
main title tags, etc. </div>
<div><br>
</div>
<div>Hope this gives some useful
suggestions that point towards a
practical solution. </div>
<div><br>
</div>
<div>Best regards </div>
<div><br>
</div>
<div>David</div>
<div><br>
</div>
<div><br>
</div>
<div>Sent from ProtonMail Mobile</div>
</div>
<div><br>
</div>
<div><br>
</div>
On Mon, May 13, 2019 at 14:57, Michael H
<<a href="mailto:cmahte@gmail.com"
target="_blank" moz-do-not-send="true">cmahte@gmail.com</a>>
wrote:
<blockquote
class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_quote"
type="cite">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div dir="ltr">
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large">Cyrille<br>
<br>
LibreOffice Draw attempts to
open the pagemaker file, with
limited success. But it
confirms that even in the
pagemaker source, the verse
numbers are a separate text
stream. With this source,
there is no way to copy the
text with verse numbers
intact. It appears to be
stored with each book in it's
own text stream. Each book is
a separate text stream in the
page maker file. LO Draw isn't
rendering all of the pages,
only the first 10, So I've
only explored Matthew
further. <br>
<br>
Based on Matthew only, the
verses seem to all end with
the character "-" or ";/",
which should aid in the
reconstruction. I've looked
through the PDF and this seems
to be the case for all books
visually as well. However,
this isn't perfect: I find
1107 of these characters in
Matthew, instead of the
expected 1071 verses. But
since the text stream has a
book introduction, this is
likely easily explained.
Hopefully this gets you well
down the path to creating a
stream with verses. <br>
<br>
I would NOT start from the PDF
file, but from the pagemaker
file. The PDF almost
certainly has a lot of text
rearranging and extra
characters like page numbers
and running heads. Pagemaker
has the book text in a single
stream, in a form that will
convert to unicode relatively
easily. </div>
<div class="gmail_default"
style="font-family:garamond,serif;font-size:large"><br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<br>
<fieldset
class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_-4094282784364978796gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</div>
_______________________________________________<br>
sword-devel mailing list: <a
href="mailto:sword-devel@crosswire.org"
target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a><br>
<a
href="http://www.crosswire.org/mailman/listinfo/sword-devel"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
Instructions to unsubscribe/change your
settings at above page</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset
class="gmail-m_-4094282784364978796mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_-4094282784364978796moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_-4094282784364978796moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_-4094282784364978796moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</blockquote>
<div><br>
</div>
<div><br>
</div>
<br>
<fieldset
class="gmail-m_-4094282784364978796mimeAttachmentHeader"></fieldset>
<pre class="gmail-m_-4094282784364978796moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_-4094282784364978796moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org" target="_blank" moz-do-not-send="true">sword-devel@crosswire.org</a>
<a class="gmail-m_-4094282784364978796moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</div>
_______________________________________________<br>
sword-devel mailing list: <a
href="mailto:sword-devel@crosswire.org" target="_blank"
moz-do-not-send="true">sword-devel@crosswire.org</a><br>
<a
href="http://www.crosswire.org/mailman/listinfo/sword-devel"
rel="noreferrer" target="_blank" moz-do-not-send="true">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
Instructions to unsubscribe/change your settings at above page</blockquote>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
</blockquote>
<br>
</body>
</html>