[sword-devel] Tool for convertion html to osis

David Haslam dfhdfh at protonmail.com
Sat Feb 2 03:44:19 MST 2019

As a further thought - thinking of Greg’s examples of HTML superscripts.

It’s usually not essential to consider that verse tags are superscripted or given a fancy colour, etc.

That they are verse numbers is usually evident by their position. This being the case, squashing the character level style for them simply makes it simpler for them to be tagged using “\v “ as part of the conversion to USFM.



Sent from ProtonMail Mobile

On Sat, Feb 2, 2019 at 08:30, refdoc at gmx.net <refdoc at gmx.net> wrote:

> Greg has nailed it.
> Practically I try and work out first is the file follows any kind of pattern or is just a pile of junk. Too often latter is the case and life has become too short to bother.
> If there is a pattern then the pattern maybe expressed in CSS, in html tags in combinations. And some are maybe only in the actual text.
> My approach has always been to recognise as many as I can find and then nuke the rest. And then use any technology I know of, regex, xsl whatever to.transform each bit into something useful in OSIS.
> Usually this is an iterative process with some patterns only emerging as I go along. And others not as clear as thought originally.
> Peter
> Sent from my mobile. Please forgive shortness, typos and weird autocorrects.
> -------- Original Message --------
> Subject: Re: [sword-devel] Tool for convertion html to osis
> From: Greg Hellings
> To: SWORD Developers' Collaboration Forum
> CC:
>> On its surface, this is a very straightforward process.
>> 1. Convert the HTML (which is a specific set of defined tags using the SGML grammar) into XML (not specifically targeting XHTML, as that's a slightly different grammar, but all HTML in places where it violates XML rules can be rendered into XML-compatible forms as long as it is well-formed, since XML is just a strict subset of SGML that requires certain things that SGML leaves as optional).
>> There might be other tools to do this specifically, but you can get by with the command line tool `osx` from the Open Jade[0] framework. If you use Fedora this is available from the "opensp" package. I presume other Linux distributions have it similarly packaged.
>> 2. Convert the XML version of the HTML into OSIS using an XSLT.
>> Although the technical outline of this is relatively straightforward, that doesn't mean the actual implementation is. Step 1 is pretty simple as long as you start from a well-formed HTML document. Step 2 sounds deceptively simple. If the HTML embeds CSS or, worse yet, references an external CSS document, then you might need to consider that. If there is active JavaScript in the document, then you'll need to figure out if that does anything important to the text that needs to be preserved.
>> Additionally, HTML is a presentation format, despite some peoples' efforts to push it away from that. They've pretty much failed at that endeavor. So you'll have to figure out what the presentation markup means and convert that into OSIS. As an example, a superscript number might always be a verse number. But it might not. Encountering "<sup>1</sup>" might be easily translated to a meaning in your OSIS document, but it also might not, because it might be used by both the verses and the footnotes. Of course, those might be delimited by `<sup class="verse">1</sup>` and `<sup class="footnote">1</sup>` but it's equally possible that the difference is `<sup style="color: green">1</sup>` and `<sup style="color: blue">1</sup>` and now what does THAT mean? Of course, they might have gone with `<span style="vertical-align: super; font-size: 50%; color: blue; cursor: pointer" onclick="show_box();">1</span>` and now you've got to parse the value of `show_box` defined in JavaScript somewhere to figure out what's been done and what type of character this is.
>> So the simplicity of #2 really boils down to the nature of the HTML you're dealing with, and if it is exceedingly complex in its own right, how much of its own information you need to preserve in the OSIS that you're getting out the other end. And without any visibility into the file, none of the rest of us can begin to guess at the complexity of that process. But it CAN be automated. Like John, I've invested a lot of time back in the day on converting Logos XML to OSIS, and I'm happy to say these things are possible (just not always easy).
>> There are a number of people on this list who are and could be qualified to assist you if there was a lot more information to fill in all the details of what I've just described above. However, whether you can engage us will depend on the nature of the text you have, the way you've been given it, and any distribution requirements and rights that it's held under.
>> --Greg
>> [0] http://openjade.sourceforge.net/
>> On Fri, Feb 1, 2019 at 10:27 AM Cyrille <lafricain79 at gmail.com> wrote:
>>> Hello,
>>> All is in the title, someone have a Linux tool to convert html files to
>>> osis?
>>> In this case it is for the KD module. I download the html source files
>>> but I want not to work  a lot on it. First I will work on bible issues
>>> and not commentary. But if someone have a tool to do quickly the job...
>>> _______________________________________________
>>> sword-devel mailing list: sword-devel at crosswire.org
>>> http://www.crosswire.org/mailman/listinfo/sword-devel
>>> Instructions to unsubscribe/change your settings at above page
> @crosswire.org>@gmail.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190202/20ea2967/attachment-0001.html>

More information about the sword-devel mailing list