[sword-devel] Bible in Myanmar

Cyrille lafricain79 at gmail.com
Tue May 14 00:12:52 MST 2019


Yesterday I thought, if a pdf tool give the possibility to cut the pdf
in the middle, then the raw conversion to txt can be possible, the we
only need to convert it to UTF8.
Any idea?

Il 13/05/2019 17:40, Michael H ha scritto:
> I unzipped the pagemaker file, and when I open NT_Proverb/Pagemaker
> (10.1mb), with a Hex editor, I can 'find' all of the book names, and
> see the text there.  
>
> To see the raw text: rename NT_Proverb.pmd > NT_Proverb.zip and open
> it with a zip archive progeram.  The text is in the Pagemaker file at
> the top level of the archive, but encoded with a lot of extraneous
> information.  (The English text "Matthew" appears at hex location
> 7A76972). 
>
> When I open the fonts with fontforge, Fontforge suggests the fonts are
> encoded as unicode (but the glyphs are obviously not in the right spot.) 
> However when I copy the text (I copied from LO Draw) and paste it into
> jedit and save that as unicode: Reopening the file has a warning 'not
> unicode, text may be missing'. 
>
> So, what this means is that there are some glyphs encoded into
> locations that unicode treats as control or non-printing codes. The
> text needs to be dealt with as a specific encoding that matches
> whatever the original font actually uses. I haven't figured out what
> the original text files were encoded with. Without that knowledge, I'm
> not sure my system clipboard or editor (jedit) will properly respect
> the glyphs in unusual locations until the conversion to unicode, and I
> don't trust myself to be able to detect if it is or is not properly
> converted. 
>
> On Mon, May 13, 2019 at 10:11 AM Cyrille <lafricain79 at gmail.com
> <mailto:lafricain79 at gmail.com>> wrote:
>
>     David,
>     Probably you are right about TECkit
>     <http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit>,
>     if we get the text it will help us to convert in UNICODE.
>     About how to get the text, your method is out of my skills :)
>     I you succeed please let me know.
>
>     Il 13/05/2019 16:21, David Haslam ha scritto:
>>     Given the insights from Michael Hart, it may be feasible to
>>     temporarily rearrange the main text stream as follows :
>>
>>     1. Replace every EOL by a horizontal tab. 
>>     2. Insert an EOL after each verse end character. 
>>
>>     Observe that the above two steps are wholly reversible such that
>>     the original text stream can be restored later. 
>>
>>     In effect the text stream is now in verse per line (VPL) layout,
>>     albeit without verse tags. Some adjustments may be necessary if
>>     there any section headings, etc. 
>>
>>     3. Add line numbers with the first number being reset to 1 at the
>>     start of each chapter, numbers incrementing by 1 for each line. 
>>     4. Add a left margin USFM verse tag \v_
>>
>>     Steps 3&4 can be implemented in various ways. For my part, I’d
>>     use a bespoke TextPipe filter. 
>>
>>     Another method to consider might be to use Excel formulae. I
>>     recall resorting to such a method in the early days of Go Bible. 
>>
>>     Now restore the original layout by reverting steps 2 & 1, if this
>>     is really necessary. That is, if the original text layout
>>     appeared to be paragraphed. 
>>
>>     5. Decide how & where to insert paragraph tags. 
>>
>>     6. Add chapter tags, book ID and main title tags, etc. 
>>
>>     Hope this gives some useful suggestions that point towards a
>>     practical solution. 
>>
>>     Best regards 
>>
>>     David
>>
>>
>>     Sent from ProtonMail Mobile
>>
>>
>>     On Mon, May 13, 2019 at 14:57, Michael H <cmahte at gmail.com
>>     <mailto:cmahte at gmail.com>> wrote:
>>>     Cyrille
>>>
>>>     LibreOffice Draw attempts to open the pagemaker file, with
>>>     limited success. But it confirms that even in the pagemaker
>>>     source, the verse numbers are a separate text stream. With this
>>>     source, there is no way to copy the text with verse numbers
>>>     intact. It appears to be stored with each book in it's own text
>>>     stream. Each book is a separate text stream in the page maker
>>>     file. LO Draw isn't rendering all of the pages, only the first
>>>     10, So I've only explored Matthew further. 
>>>
>>>     Based on Matthew only, the verses seem to all end with the
>>>     character "-" or ";/", which should aid in the reconstruction.
>>>     I've looked through the PDF and this seems to be the case for
>>>     all books visually as well. However, this isn't perfect: I find
>>>     1107 of these characters in Matthew, instead of the expected
>>>     1071 verses.  But since the text stream has a book introduction,
>>>     this is likely easily explained. Hopefully this gets you well
>>>     down the path to creating a stream with verses. 
>>>
>>>     I would NOT start from the PDF file, but from the pagemaker
>>>     file.  The PDF almost certainly has a lot of text rearranging
>>>     and extra characters like page numbers and running heads. 
>>>     Pagemaker has the book text in a single stream, in a form that
>>>     will convert to unicode relatively easily. 
>>>
>>
>>
>>
>>     _______________________________________________
>>     sword-devel mailing list: sword-devel at crosswire.org <mailto:sword-devel at crosswire.org>
>>     http://www.crosswire.org/mailman/listinfo/sword-devel
>>     Instructions to unsubscribe/change your settings at above page
>
>     _______________________________________________
>     sword-devel mailing list: sword-devel at crosswire.org
>     <mailto:sword-devel at crosswire.org>
>     http://www.crosswire.org/mailman/listinfo/sword-devel
>     Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190514/24597ae9/attachment.html>


More information about the sword-devel mailing list