[sword-devel] Introducing the Bible Scraper

Arnaud Vié unas.zole+avie at gmail.com
Sat Jun 1 19:09:14 EDT 2024


Hello all,

Cyrille already teased it in some of his previous mails on this list, but
I've been working for several months on a tool to scrape bibles from any
web page into a standard format (OSIS and USFM outputs are supported) : the
Bible Scraper.
It mostly serves two purposes :

   - *Help converting "loosely formatted" bibles, such as bibles
   transcribed from facsimiles on wikisource, to a standard semantic format.*
   These bibles usually have some light formatting that aims at replicating
   the visual appearance of the original document, but without a strong
   semantic markup. With proper configuration, the scraper can convert those
   to a fully formed OSIS or USFM document, as long as the formatting is
   consistent throughout the bible.
   This is the usage Cyrille has been experimenting a lot recently, and
   with which we have been achieving promising results.

   - *Allow individual users to convert bibles, which are freely available
   on the web but which we don't have the rights to redistribute, into sword
   modules for their personal usage*.
   This relies on the right to personal copy, which is quite strongly
   upheld in French law (and probably most other european countries, as there
   are texts on the topic from the CJEU as well) : as long as a user has
   legitimate access to the contents he wishes to copy, he is allowed to
   download and process it for personal use. Since the scraper is just
   software that any user can run on his own machine, there is no intermediate
   that could be accused of illegitimate "redistribution" in any form.

In its current state, the tool is still mostly targeted at developers (I
don't yet publish a downloadable artifact, so interested users have to
clone the git repo, and run a maven build), but it's becoming mature enough
to be shared with those who want to have a look :
https://github.com/UnasZole/bible-scraper

I'm open to any kind of feedback or suggestions of course !
In particular :

   - if you have any specific website in mind that you would like to be
   able to build sword modules from, let me know, we can try to add it.
   (Currently I only included a few French websites, but I'm interested to add
   some other languages).
   - And if you are knowledgeable about the intellectual property laws in
   other countries, I'm interested : currently, I've added a section to the
   README explaining why the usage of the scraper on any public website is
   allowed in France with references to the related texts, but it would
   probably be useful to have similar information for users from other
   countries.

Thanks all and best regards,

Arnaud
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20240602/a9bf576d/attachment.htm>


More information about the sword-devel mailing list