[sword-devel] Tool for convertion html to osis

Greg Hellings greg.hellings at gmail.com
Fri Feb 1 16:14:48 MST 2019


On its surface, this is a very straightforward process.

1. Convert the HTML (which is a specific set of defined tags using the SGML
grammar) into XML (not specifically targeting XHTML, as that's a slightly
different grammar, but all HTML in places where it violates XML rules can
be rendered into XML-compatible forms as long as it is well-formed, since
XML is just a strict subset of SGML that requires certain things that SGML
leaves as optional).

There might be other tools to do this specifically, but you can get by with
the command line tool `osx` from the Open Jade[0] framework. If you use
Fedora this is available from the "opensp" package. I presume other Linux
distributions have it similarly packaged.

2. Convert the XML version of the HTML into OSIS using an XSLT.

Although the technical outline of this is relatively straightforward, that
doesn't mean the actual implementation is. Step 1 is pretty simple as long
as you start from a well-formed HTML document. Step 2 sounds deceptively
simple. If the HTML embeds CSS or, worse yet, references an external CSS
document, then you might need to consider that. If there is active
JavaScript in the document, then you'll need to figure out if that does
anything important to the text that needs to be preserved.

Additionally, HTML is a presentation format, despite some peoples' efforts
to push it away from that. They've pretty much failed at that endeavor. So
you'll have to figure out what the presentation markup means and convert
that into OSIS. As an example, a superscript number might always be a verse
number. But it might not. Encountering "<sup>1</sup>" might be easily
translated to a meaning in your OSIS document, but it also might not,
because it might be used by both the verses and the footnotes. Of course,
those might be delimited by `<sup class="verse">1</sup>` and `<sup
class="footnote">1</sup>` but it's equally possible that the difference is
`<sup style="color: green">1</sup>` and `<sup style="color: blue">1</sup>`
and now what does THAT mean? Of course, they might have gone with `<span
style="vertical-align: super; font-size: 50%; color: blue; cursor: pointer"
onclick="show_box();">1</span>` and now you've got to parse the value of
`show_box` defined in JavaScript somewhere to figure out what's been done
and what type of character this is.

So the simplicity of #2 really boils down to the nature of the HTML you're
dealing with, and if it is exceedingly complex in its own right, how much
of its own information you need to preserve in the OSIS that you're getting
out the other end. And without any visibility into the file, none of the
rest of us can begin to guess at the complexity of that process. But it CAN
be automated. Like John, I've invested a lot of time back in the day on
converting Logos XML to OSIS, and I'm happy to say these things are
possible (just not always easy).

There are a number of people on this list who are and could be qualified to
assist you if there was a lot more information to fill in all the details
of what I've just described above. However, whether you can engage us will
depend on the nature of the text you have, the way you've been given it,
and any distribution requirements and rights that it's held under.

--Greg

[0] http://openjade.sourceforge.net/

On Fri, Feb 1, 2019 at 10:27 AM Cyrille <lafricain79 at gmail.com> wrote:

> Hello,
> All is in the title, someone have a Linux tool to convert html files to
> osis?
> In this case it is for the KD module. I download the html source files
> but I want not to work  a lot on it. First I will work on bible issues
> and not commentary. But if someone have a tool to do quickly the job...
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://www.crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190201/0f8ef734/attachment.html>


More information about the sword-devel mailing list