[sword-devel] Tool for convertion html to osis

Michael H cmahte at gmail.com
Sat Feb 2 05:17:27 MST 2019


For a conversion project that won't be repeated like a one off non XML
stylesheet HTML file:

I'd suggest investing in a good editor program, which deals with multiple
files in a search and replace. And a macro language.

Working with multi file search:
0. Strip the non text sections (scripts, side bars, menus)
1. Inventory your HTML markup
2. Map the stylesheet markup tags to USFM tag.
3. convert.
4. understand what's left (html entities, open commentary that needs
tagging, etc. useless html.)

If you have a macro language, it's best to accomplish all of this via macro
or script.


On Sat, Feb 2, 2019 at 4:46 AM David Haslam <dfhdfh at protonmail.com> wrote:

> As a further thought - thinking of Greg’s examples of HTML superscripts.
>
> It’s usually not essential to consider that verse tags are superscripted
> or given a fancy colour, etc.
>
> That they are verse numbers is usually evident by their position. This
> being the case, squashing the character level style for them simply makes
> it simpler for them to be tagged using “\v “ as part of the conversion to
> USFM.
>
> Regards,
>
> David.
>
> Sent from ProtonMail Mobile
>
>
> On Sat, Feb 2, 2019 at 08:30, refdoc at gmx.net <refdoc at gmx.net> wrote:
>
> Greg has nailed it.
>
> Practically I try and work out first is the file follows any kind of
> pattern or is just a pile of junk. Too often latter is the case and life
> has become too short to bother.
>
> If there is a pattern then the pattern maybe expressed in CSS, in html
> tags in combinations. And some are maybe only in the actual text.
>
> My approach has always been to recognise as many as I can find and then
> nuke the rest. And then use any technology I know of, regex, xsl whatever
> to.transform each bit into something useful in OSIS.
>
> Usually this is an iterative process with some patterns only emerging as I
> go along. And others not as clear as thought originally.
>
> Peter
>
> Sent from my mobile. Please forgive shortness, typos and weird
> autocorrects.
>
>
> -------- Original Message --------
> Subject: Re: [sword-devel] Tool for convertion html to osis
> From: Greg Hellings
> To: SWORD Developers' Collaboration Forum
> CC:
>
>
> On its surface, this is a very straightforward process.
>
> 1. Convert the HTML (which is a specific set of defined tags using the
> SGML grammar) into XML (not specifically targeting XHTML, as that's a
> slightly different grammar, but all HTML in places where it violates XML
> rules can be rendered into XML-compatible forms as long as it is
> well-formed, since XML is just a strict subset of SGML that requires
> certain things that SGML leaves as optional).
>
> There might be other tools to do this specifically, but you can get by
> with the command line tool `osx` from the Open Jade[0] framework. If you
> use Fedora this is available from the "opensp" package. I presume other
> Linux distributions have it similarly packaged.
>
> 2. Convert the XML version of the HTML into OSIS using an XSLT.
>
> Although the technical outline of this is relatively straightforward, that
> doesn't mean the actual implementation is. Step 1 is pretty simple as long
> as you start from a well-formed HTML document. Step 2 sounds deceptively
> simple. If the HTML embeds CSS or, worse yet, references an external CSS
> document, then you might need to consider that. If there is active
> JavaScript in the document, then you'll need to figure out if that does
> anything important to the text that needs to be preserved.
>
> Additionally, HTML is a presentation format, despite some peoples' efforts
> to push it away from that. They've pretty much failed at that endeavor. So
> you'll have to figure out what the presentation markup means and convert
> that into OSIS. As an example, a superscript number might always be a verse
> number. But it might not. Encountering "<sup>1</sup>" might be easily
> translated to a meaning in your OSIS document, but it also might not,
> because it might be used by both the verses and the footnotes. Of course,
> those might be delimited by `<sup class="verse">1</sup>` and `<sup
> class="footnote">1</sup>` but it's equally possible that the difference is
> `<sup style="color: green">1</sup>` and `<sup style="color: blue">1</sup>`
> and now what does THAT mean? Of course, they might have gone with `<span
> style="vertical-align: super; font-size: 50%; color: blue; cursor: pointer"
> onclick="show_box();">1</span>` and now you've got to parse the value of
> `show_box` defined in JavaScript somewhere to figure out what's been done
> and what type of character this is.
>
> So the simplicity of #2 really boils down to the nature of the HTML you're
> dealing with, and if it is exceedingly complex in its own right, how much
> of its own information you need to preserve in the OSIS that you're getting
> out the other end. And without any visibility into the file, none of the
> rest of us can begin to guess at the complexity of that process. But it CAN
> be automated. Like John, I've invested a lot of time back in the day on
> converting Logos XML to OSIS, and I'm happy to say these things are
> possible (just not always easy).
>
> There are a number of people on this list who are and could be qualified
> to assist you if there was a lot more information to fill in all the
> details of what I've just described above. However, whether you can engage
> us will depend on the nature of the text you have, the way you've been
> given it, and any distribution requirements and rights that it's held under.
>
> --Greg
>
> [0] http://openjade.sourceforge.net/
>
> On Fri, Feb 1, 2019 at 10:27 AM Cyrille <lafricain79 at gmail.com> wrote:
>
>> Hello,
>> All is in the title, someone have a Linux tool to convert html files to
>> osis?
>> In this case it is for the KD module. I download the html source files
>> but I want not to work  a lot on it. First I will work on bible issues
>> and not commentary. But if someone have a tool to do quickly the job...
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://www.crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190202/29b78953/attachment.html>


More information about the sword-devel mailing list