[sword-devel] Tool for convertion html to osis

David Haslam dfhdfh at protonmail.com
Fri Feb 1 15:37:50 MST 2019


As HTML files can be opened using Microsoft Word, my initial step is to save the file as RTF type.

I then use WordPad to open and resave the RTF file. This reduces size and clutter.

At this stage, one needs to determine if any of the text styles are semantically significant. e.g. Are italics used for added words? And has anything of importance already been squashed?

The key understanding is that RTF files can be processed by scripts or filters. You can soon learn what are the useful tags.

Assuming something’s been done to mark such words with some non-RTF tags such that the next step no longer loses the markup, that step is to open with WordPad and save as Unicode text (which gives UCS-2 aka UTF-16 LE).

Open the text file with (e.g.) Notepad++ and change the encoding to UTF-8, and resave.

Now the rest of the scripting can be done on the plain text.

I’ve found success with this mixed general purpose approach for several projects.

[The first step can be done using LibreOffice, if that’s what you prefer. ]

Best regards,

David

Sent from ProtonMail Mobile

On Fri, Feb 1, 2019 at 22:07, Dudeck, John <John.Dudeck at sim.org> wrote:

> I might just say from my recent experience, creating OSIS from other sources is not a trivial matter.
>
> Depending on whether you are creating a Bible, a Commentary, or a GenBook, the process is not the same.
>
> It took me two years to develop Perl scripts that convert from Logos XML to OSIS for Bibles, Commentaries, GenBooks, and Dictionaries.
>
> For example, even though Logos XML is well-structured, my converter for Bibles is customized to the three Bible texts that it converted, and to use it for other Bibles will require further customization for each. For Commentaries and GenBooks it handles them in a more generic way without need for further customization.
>
> OSIS is mainly a semantic markup scheme, highly adapted to Scripture, but little else. Since html is a totally flexible structure, you need a way to map the structural elements in your source to structural elements in OSIS. It has very limited formatting capabilities. You need to have a way to deal with CSS. Rendering is mostly left up to the Client User Interface.
>
> I wish I had an html to OSIS converter to offer you, but maybe somebody else has come up with a method that is straight-forward.
>
> John
>
>> Hello,
>> All is in the title, someone have a Linux tool to convert html files to
>> osis?
>> In this case it is for the KD module. I download the html source files
>> but I want not to work  a lot on it. First I will work on bible issues
>> and not commentary. But if someone have a tool to do quickly the job...
>
> John Dudeck
> Programmer at Editions Cle                             Lyon, France
> john.dudeck at sim.org                            john at editionscle.com
> --
> "All programmers are optimists." -- Frederick Brooks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.crosswire.org/pipermail/sword-devel/attachments/20190201/b8526bd9/attachment-0001.html>


More information about the sword-devel mailing list