<div>If Michael’s observations are anything to go by, then maybe I can script the recovery of chapter &amp; verse tags.&nbsp;</div><div><br></div><div>We shall see ....</div><div><br></div><div>Even if I’m not immediately successful - valuable lessons can be learned in the attempt.&nbsp;</div><div><br></div><div>David</div><div><br></div><div id="protonmail_mobile_signature_block"><div>Sent from ProtonMail Mobile</div></div> <div><br></div><div><br></div>On Tue, May 14, 2019 at 21:21, Cyrille &lt;<a href="mailto:lafricain79@gmail.com" class="">lafricain79@gmail.com</a>&gt; wrote:<blockquote class="protonmail_quote" type="cite">




    Ok thank you!&nbsp; I have already all the text in unicode but without
    the verse numbers and chapters... I begun manually...<br>
    <br>
    <div class="moz-cite-prefix">Il 14/05/2019 22:17, David Haslam ha
      scritto:<br>
    </div>
    <blockquote type="cite">

      <div>Hi&nbsp;Cyrille&nbsp;</div>
      <div><br>
      </div>
      <div>If I can find the time tomorrow or later, I’ll have a look at
        what might be feasible.&nbsp;</div>
      <div><br>
      </div>
      <div>Thanks for all these useful links.&nbsp;</div>
      <div><br>
      </div>
      <div>David</div>
      <div><br>
      </div>
      <div id="protonmail_mobile_signature_block">
        <div>Sent from ProtonMail Mobile</div>
      </div>
      <div><br>
      </div>
      <div><br>
      </div>
      On Tue, May 14, 2019 at 14:08, Cyrille &lt;<a href="mailto:lafricain79@gmail.com" class="">lafricain79@gmail.com</a>&gt; wrote:
      <blockquote class="protonmail_quote" type="cite"> I send my
        message again because it was bigger.<br>
        <br>
        The conversion to UTF-8 is 99% solved!! I used a online
        converter:<br>
        <a class="moz-txt-link-freetext" href="https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html">https://thanlwinsoft.github.io/www.thanlwinsoft.org/ThanLwinSoft/MyanmarUnicode/Conversion/myanmarConverter.html</a><br>
        or:<br>
        <a class="moz-txt-link-freetext" href="http://burglish.my-mm.org/latest/trunk/web/fontconv.htm">http://burglish.my-mm.org/latest/trunk/web/fontconv.htm</a><br>
        <br>
        See the result <a href="https://framadrop.org/r/jKnYnvuQIH#mE+FWcvzD1N/Omnfr7uWMZmI/HZUUVPdvnVVkBFyFrA=">here</a>.<br>
        <br>
        Now the only problem is how to get the verse and chapter
        number... <br>
        <br>
        <br>
        <div class="moz-cite-prefix">Il 14/05/2019 13:53, Michael H ha
          scritto:<br>
        </div>
        <blockquote type="cite">
          <div dir="ltr">
            <div dir="ltr">
              <div dir="ltr">
                <div class="gmail_default"><font size="4" face="garamond,
 serif">Cyrille, (Peter),&nbsp;<br>
                    <br>
                    Maybe further discussion on this belongs in Gitlab
                    as issues.&nbsp; Can I get added to this project?&nbsp;<br>
                    <br>
                    Here are the first few lines of Matthew copied from
                    the PDF:&nbsp;</font><br>
                  ------<br>
                  <div class="gmail_default" style="font-family:garamond,serif;font-size:large">&amp;Sifrmaw;OD;
                    {0Ha*vdusrf;</div>
                  <div class="gmail_default" style="font-family:garamond,serif;font-size:large">The
                    Gospel According to Matthew</div>
                  <div class="gmail_default" style="font-family:garamond,serif;font-size:large">ed'gef;</div>
                  <div class="gmail_default" style="font-family:garamond,serif;font-size:large">usr;f
                    ûyy*k Kd¾v f &amp;iS rf maw;O;D \b0rwS wf r;f</div>
                  <div class="gmail_default" style="font-family:garamond,serif;font-size:large">usr;f
                    ûyy*k Kd¾v f &amp;iS rf maw;O;Don f *gavav;,e,rf
                    S*sL;vrl sK;d tmvaf z;O;D \om;jzp\f / (rmu k2;14)</div>
                  <div class="gmail_default" style="font-family:garamond,serif;font-size:large">olonf
                    tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
                    a,Zl;ocif\aemufvdkufwynfhrjzpfrD ol\trnfrSm</div>
                  <div class="gmail_default" style="font-family:garamond,serif;font-size:large">av0djzp\f
                    / ool n f wad b;&amp;,d tidk tf e;DwGi f a,Z;lociEf
                    iS ahf wG U Ny;D<br>
                    <br>
                  </div>
                  <div class="gmail_default" style="font-family:garamond,serif;font-size:large">-----</div>
                  <div class="gmail_default"><font size="4" face="garamond,
 serif">And here are the first
                      few lines of Matthew copied from the Pagemaker
                      file:&nbsp;</font></div>
                  <div class="gmail_default"><font size="4" face="garamond,
 serif">-----<br>
                    </font>
                    <div class="gmail_default"><font size="4" face="garamond, serif">Sifrmaw;OD; {0Ha*vdusrf;</font></div>
                    <div class="gmail_default"><font size="4" face="garamond, serif">The Gospel According to
                        Matthew</font></div>
                    <div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">ed'gef;</span><br>
                    </div>
                    <div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf&nbsp;
                        &amp;Sifrmaw;OD;\b0rSwfwrf;&nbsp;&nbsp;</span><br>
                    </div>
                    <div class="gmail_default"><span style="font-family:garamond,serif;font-size:large">usrf;�yyk*�dKvf&nbsp;
                        &amp;Sifrmaw;OD;onf&nbsp; *gavav;,e,frS *sL;vlrsKd;
                        tmvfaz;OD;\om;jzpf\/ (rmuk 2;14) olonf&nbsp;
                        tcGefcHoltjzpf trIxrf;chJonf/ (vk 5;27)
                        a,Zl;ocif\aemufvdkufwynfhrjzpfrD&nbsp; ol\trnfrSm
                        av0djzpf\/ olonf&nbsp; wdab;&amp;d,tkdifteD;wGif&nbsp;
                        a,Zl;ocifESifhawGU&nbsp; NyD;<br>
                        <br>
                        <br>
                        You can see that some letters have changed, and
                        some others are in a different order.&nbsp;<br>
                        <br>
                      </span><span style="font-family:garamond,serif;font-size:large">The
                        letters that change are likely those points that
                        aren't compatible with unicode, and pagemaker
                        reassigned them to ensure that the file is more
                        widely viewable. Since a conversion is already
                        planned, these won't matter as much, but the
                        font embedded in the PDF is different than the
                        font attached to the pagemaker file,&nbsp; If you do
                        start from the PDF, you'll need to extract the
                        font to get the code points.&nbsp;</span><br style="font-family:garamond,serif;font-size:large">
                      <span style="font-family:garamond,serif;font-size:large"><br>
                        The problem is that the PDF export from
                        pagemaker sorts the letters into the order they
                        appear on the page.&nbsp; Burmese text has Indian
                        style ligatures, where vowels tend to jump over
                        or under the previous letters, sometimes back 2
                        or three letters. If you study the following
                        snippets from the beginning of Matthew, you can
                        see there is a difference in order, as well as
                        some glyphs are modified.&nbsp;<br>
                        <br>
                        So, from the PDF letters are out of order, but
                        from Pagemaker, letters are encoded into control
                        points. Fixing the control points is easy and
                        happens with the unicode conversion.&nbsp; Fixing the
                        letter order is not easy. You'll need a first
                        language speaker and plenty of time.&nbsp;</span></div>
                    <div class="gmail_default"><span style="font-family:garamond,serif;font-size:large"><br>
                        The guidance I received on another group was to
                        use either LO Draw or Indesign to export the
                        text from Pagemaker.&nbsp; I'll look into LO Draw
                        again, but I don't have access to an older
                        version of Indesign (the pagemaker import was
                        removed in CS6).&nbsp;</span><span style="font-family:garamond,serif;font-size:large"><br>
                      </span></div>
                  </div>
                </div>
              </div>
            </div>
          </div>
          <div dir="ltr">
            <div class="gmail_default" style="font-family:garamond,serif;font-size:large"><br>
            </div>
          </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Mon, May 13, 2019 at
              10:40 AM Michael H &lt;<a href="mailto:cmahte@gmail.com">cmahte@gmail.com</a>&gt; wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px
              0px
 0.8ex;border-left:1px solid
              rgb(204,204,204);padding-left:1ex">
              <div dir="ltr">
                <div class="gmail_default" style="font-family:garamond,serif;font-size:large">I
                  unzipped the pagemaker file, and when I open
                  NT_Proverb/Pagemaker (10.1mb), with a Hex editor, I
                  can 'find' all of the book names, and see the text
                  there.&nbsp;&nbsp;<br>
                  <br>
                  To see the raw text: rename NT_Proverb.pmd &gt;
                  NT_Proverb.zip and open it with a zip archive
                  progeram.&nbsp; The text is in the Pagemaker file at the
                  top level of the archive, but encoded with a lot of
                  extraneous information.&nbsp; (The English text "Matthew"
                  appears at hex location 7A76972).&nbsp;<br>
                  <br>
                  When I open the fonts with fontforge, Fontforge
                  suggests the fonts are encoded as unicode (but the
                  glyphs are obviously not in the right spot.)&nbsp;<br>
                  However when I copy the text (I copied from LO Draw)
                  and paste it into jedit and save that as unicode:
                  Reopening the file has a warning 'not unicode, text
                  may be missing'.&nbsp;<br>
                  <br>
                  So, what this means is that there are some glyphs
                  encoded into locations that unicode treats as control
                  or non-printing codes. The text needs to be dealt with
                  as a specific encoding that matches whatever the
                  original font actually uses. I haven't figured out
                  what the original text files were encoded with.
                  Without that knowledge, I'm not sure my system
                  clipboard or editor (jedit) will properly respect the
                  glyphs in unusual locations until the conversion to
                  unicode, and I don't trust myself to be able to detect
                  if it is or is not properly converted.&nbsp;<br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Mon, May 13, 2019
                  at 10:11 AM Cyrille &lt;<a href="mailto:lafricain79@gmail.com">lafricain79@gmail.com</a>&gt;
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px
                  0px
 0.8ex;border-left:1px solid

                  rgb(204,204,204);padding-left:1ex">
                  <div bgcolor="#FFFFFF"> David,<br>
                    Probably you are right about <a href="http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&amp;cat_id=TECkit">TECkit</a>, if we get the
                    text it will help us to convert in UNICODE.<br>
                    About how to get the text, your method is out of my
                    skills :)<br>
                    I you succeed please let me know.<br>
                    <br>
                    <div class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-cite-prefix">Il
                      13/05/2019 16:21, David Haslam ha scritto:<br>
                    </div>
                    <blockquote type="cite">
                      <div>Given the insights from Michael Hart, it may
                        be feasible to temporarily rearrange the main
                        text stream as follows :</div>
                      <div><br>
                      </div>
                      <div>1. Replace every EOL by a horizontal tab.&nbsp;</div>
                      <div>2. Insert an EOL after each verse end
                        character.&nbsp;</div>
                      <div><br>
                      </div>
                      <div>Observe that the above two steps are
                        wholly&nbsp;reversible such that the original text
                        stream can be restored later.&nbsp;</div>
                      <div><br>
                      </div>
                      <div>In effect the text stream is now in verse per
                        line (VPL) layout, albeit without verse tags.
                        Some adjustments may be necessary if there any
                        section headings, etc.&nbsp;</div>
                      <div><br>
                      </div>
                      <div>3. Add line numbers with the first number
                        being reset to 1 at the start of each chapter,
                        numbers incrementing by 1 for each line.&nbsp;</div>
                      <div>4. Add a left margin USFM verse tag \v_<br>
                      </div>
                      <div><br>
                      </div>
                      <div id="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_mobile_signature_block">
                        <div>Steps 3&amp;4 can be implemented in various
                          ways. For my part, I’d use a bespoke TextPipe
                          filter.&nbsp;</div>
                        <div><br>
                        </div>
                        <div>Another method to consider might be to use
                          Excel formulae. I recall resorting to such a
                          method in the early days of Go Bible.&nbsp;</div>
                        <div><br>
                        </div>
                        <div>Now restore the original layout by
                          reverting steps 2 &amp; 1, if this is really
                          necessary. That is, if the original text
                          layout appeared to be paragraphed.&nbsp;</div>
                        <div><br>
                        </div>
                        <div>5. Decide how &amp; where to insert
                          paragraph tags.&nbsp;</div>
                        <div><br>
                        </div>
                        <div>6. Add chapter tags, book ID and main title
                          tags, etc.&nbsp;</div>
                        <div><br>
                        </div>
                        <div>Hope this gives some useful suggestions
                          that point towards a practical solution.&nbsp;</div>
                        <div><br>
                        </div>
                        <div>Best regards&nbsp;</div>
                        <div><br>
                        </div>
                        <div>David</div>
                        <div><br>
                        </div>
                        <div><br>
                        </div>
                        <div>Sent from ProtonMail Mobile</div>
                      </div>
                      <div><br>
                      </div>
                      <div><br>
                      </div>
                      On Mon, May 13, 2019 at 14:57, Michael H &lt;<a href="mailto:cmahte@gmail.com">cmahte@gmail.com</a>&gt;
                      wrote:
                      <blockquote class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636protonmail_quote" type="cite">
                        <div dir="ltr">
                          <div dir="ltr">
                            <div dir="ltr">
                              <div dir="ltr">
                                <div class="gmail_default" style="font-family:garamond,serif;font-size:large">Cyrille<br>
                                  <br>
                                  LibreOffice Draw attempts to open the
                                  pagemaker file, with limited success.
                                  But it confirms that even in the
                                  pagemaker source, the verse numbers
                                  are a separate text stream. With this
                                  source, there is no way to copy the
                                  text with verse numbers intact. It
                                  appears to be stored with each book in
                                  it's own text stream. Each book is a
                                  separate text stream in the page maker
                                  file. LO Draw isn't rendering all of
                                  the pages, only the first 10, So I've
                                  only explored Matthew further.&nbsp;<br>
                                  <br>
                                  Based on Matthew only, the verses seem
                                  to all end with the character "-" or
                                  ";/", which should aid in the
                                  reconstruction. I've looked through
                                  the PDF and this seems to be the case
                                  for all books visually as well.
                                  However, this isn't perfect: I find
                                  1107 of these characters in Matthew,
                                  instead of the expected 1071 verses.&nbsp;
                                  But since the text stream has a book
                                  introduction, this is likely easily
                                  explained. Hopefully this gets you
                                  well down the path to creating a
                                  stream with verses.&nbsp;<br>
                                  <br>
                                  I would NOT start from the PDF file,
                                  but from the pagemaker file.&nbsp; The PDF
                                  almost certainly has a lot of text
                                  rearranging and extra characters like
                                  page numbers and running heads.&nbsp;
                                  Pagemaker has the book text in a
                                  single stream, in a form that will
                                  convert to unicode relatively easily.&nbsp;</div>
                                <div class="gmail_default" style="font-family:garamond,serif;font-size:large"><br>
                                </div>
                              </div>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                      <div><br>
                      </div>
                      <div><br>
                      </div>
                      <br>
                      <fieldset class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636mimeAttachmentHeader"></fieldset>
                      <pre class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-quote-pre">_______________________________________________
sword-devel mailing list: <a class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="gmail-m_3757925966681618317gmail-m_-6550991463107192144gmail-m_-2496802141858019636moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
                    </blockquote>
                    <br>
                  </div>
                  _______________________________________________<br>
                  sword-devel mailing list: <a href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a><br>
                  <a href="http://www.crosswire.org/mailman/listinfo/sword-devel" rel="noreferrer">http://www.crosswire.org/mailman/listinfo/sword-devel</a><br>
                  Instructions to unsubscribe/change your settings at
                  above page</blockquote>
              </div>
            </blockquote>
          </div>
          <br>
          <fieldset class="mimeAttachmentHeader"></fieldset>
          <pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
        </blockquote>
        <br>
      </blockquote>
      <div><br>
      </div>
      <div><br>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
sword-devel mailing list: <a class="moz-txt-link-abbreviated" href="mailto:sword-devel@crosswire.org">sword-devel@crosswire.org</a>
<a class="moz-txt-link-freetext" href="http://www.crosswire.org/mailman/listinfo/sword-devel">http://www.crosswire.org/mailman/listinfo/sword-devel</a>
Instructions to unsubscribe/change your settings at above page</pre>
    </blockquote>
    <br>


</blockquote><div><br></div><div><br></div>