[sword-devel] EMTV text source URL is now unrelated

Wed Oct 12 12:41:42 MST 2011

On Wed, Oct 12, 2011 at 2:18 PM, David Haslam <dfhmch at googlemail.com> wrote:
> Hi Troy,
>
> Yes - you're probably right about lack of a readily available tool for
> direct conversion.
>
> Had I been tackling the task, I might have considered these steps:
>
> 1. Open each HTML file using MS Word, save each file as RTF.
> 2. Open each RTF file using WordPad, save again as RTF (smaller and simpler
> file structure).
> 3. Create & run a script to process the RTF tags for italics attribute and
> for red font colour.
> 4. Open the processed RTF files using WordPad, save as Unicode text
> (encoded as UTF-16 LE).
> 5. Use a suitable editor to open the Unicode text files and change encoding
> to UTF-8 (without BOM).

This seems incredibly more complicated than it needs to be and
probably a terrible idea to filter HTML through MS Word.  We talk
about format-shifting and information loss as a result frequently.
Every programming language a person is likely to know has a library
for directly parsing HTML in some fashion. If you have any knowledge
of script and coding it is probably a much better idea to leverage one
of those and make a direct step from HTML to OSIS.  I have done this
at least twice now and with only a small amount of work you can adapt
a script that will process any source text from a given format source.

With Wycliffe we have two source formats which are proprietary SGML
formats akin to HTML. We wrote parsing scripts using well established
SGML and XML formatting tools and are able to leverage this for
automated processing of around 800 different source texts. Moreover
most scripting languages have a simple mechanism that will do the
encoding shifting as well.  A single line in the script is sufficient
in Python to convert from any given source encoding into UTF-8.
Assume that the variable 'text' contains the source in encoding 'enc'.
Just execute
text.decode(enc).encode('utf-8')
and you're done. The SWORD library has similar functionality in SWBuf,
fairly sure Perl has similar abilities.

All in all, you're much better to create a script to take straight out
of the source markup (HTML in this case) and into OSIS. Yes, you'd
need to create a new script for each source, as each one will utilize
different HTML constructs, but a single script could be used to - for
instance - lift all the translations on Biblegateway into a person's
local repository. A single script could run through his website and
scrape it and dump it into an OSIS text with little effort. The markup
format is simple and readily handled by many HTML loading/parsing
libraries.

--Greg

>
> After step 5 you'd have something similar to where you began converting
> plain text to OSIS, but with some ingenuity at step 3, you'd also have some
> elementary markup for italics and red letters that survives the complete
> loss of formating attributes at step 4.
>
> During my Go Bible activities, I've used this approach more times than I can
> recall.
>
> /The steepest part of the learning curve is getting used to the format of
> RTF files when viewed by an ordinary text editor/.
>
> After step 5, it's often simpler to do the next conversion to USFM, and then
> use usfm2osis.pl
>
> Best regards,
> David
>
>
>
>
> --
> View this message in context: http://sword-dev.350566.n4.nabble.com/EMTV-text-source-URL-is-now-unrelated-tp3871411p3899264.html
> Sent from the SWORD Dev mailing list archive at Nabble.com.
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>