[sword-devel] Introducing the Bible Scraper

Arnaud Vié unas.zole+avie at gmail.com
Mon Jun 3 16:24:53 EDT 2024


Sorry Cyrille, I'll keep the repository in my Github personal account for
the time being.

The main reason is that the scraper is still evolving in a legal grey area,
by allowing people to save and convert copyrighted contents - since I
intend to provide parser configuration yaml files for as many websites as I
can, to eventually make more and more bibles usable in AndBible and the
sword ecosystem.
I've done enough research to be confident I'm safe as per French law, but
as I integrate parsers for more bibles from websites in other countries,
there might be complaints. If that happens, it's much better if such
complaints target me alone as it's my personal project, and do not affect
CrossWire as a whole - especially since CrossWire does not really operate
under French jurisdiction and thus might not be as protected as I am.
As Donna said, it's perfectly fine if you want to keep a fork elsewhere,
but I'd suggest making it private, not publicly affiliated with CrossWire.

In addition to that, we had a discussion on the Github vs Gitlab topic a
few months ago (cf.
http://crosswire.org/pipermail/sword-devel/2024-February/049943.html ), and
I still believe that having some lively OSIS and Sword related projects on
Github will improve the visibility of the Sword ecosystem to attract new
developers in the long run, more so than Gitlab.

(On that topic, my proposal to take over and rejuvenate the GitHub
crosswire project, specifically the jsword repo, and adding a new repo for
the OSIS specification, still stands.)

Cheers,

Anraud


Le dim. 2 juin 2024 à 16:33, Fr Cyrille <fr.cyrille at tiberiade.be> a écrit :

> Hi Arnaud,
> What do you think to move bible-scraper from github repo to our gitlab
> repo? I did this but not with the last commits. I make you dev on it.
> https://gitlab.com/crosswire-bible-society/bible-scraper
>
> Le 02/06/2024 à 11:46, Arnaud Vié a écrit :
>
> Thank you both for your interest !
>
> > What about commentary?
> > https://www.awmi.net/reading/online-bible-commentary/
>
> Not yet, I'm really focusing on bibles for the time being - that's a lot
> of work already !
> But nothing prevents adapting the solution to commentaries in the future,
> I'll keep that idea in mind :-)
>
> > If you want to use CzeBKR as your test case, I am ready to help
> > you with any testing or Czech issues or whatever
>
> Thanks a lot !
> I've just pushed a scraper configuration for this bible :
> https://github.com/UnasZole/bible-scraper/blob/master/src/main/resources/scrapers/GenericHtml/KralickaWikisource.yaml
> Main books were easy to parse - deuterocanonical books extracted from a
> different manuscript were a bit messier.
> I made a few assumptions (I interpret italics in verse as translation
> additions, and side notes in deuterocanonical books as section titles, etc.)
> Feel free to test it : after checking out and building the repository, you
> should just need to run for example:
>
> > ./run.sh scrape -s GenericHtml -i KralickaWikisource -b Ps -c 1 -w USFM
>
> Cheers,
>
> Arnaud
>
> Le dim. 2 juin 2024 à 08:50, Matěj Cepl <mcepl at cepl.eu> a écrit :
>
>> On Sun Jun 2, 2024 at 1:09 AM CEST, Arnaud Vié wrote:
>> > I'm open to any kind of feedback or suggestions of course !
>> > In particular :
>> >
>> >    - if you have any specific website in mind that you would like to be
>> >    able to build sword modules from, let me know, we can try to add it.
>> >    (Currently I only included a few French websites, but I'm interested
>> to add
>> >    some other languages).
>>
>> Sword module CzeBKR is sourced from the Czech WikiSource [1]
>> and there seems to be the official way [2] how to get source
>> in some hopefully more useful formats (plain text, RTF, HTML,
>> EPubs). I was using my own home-grown Python script [3], but it
>> seems like with all web-scrapping scripts it rotten away (that
>> script is under some of kind of very free open source license,
>> let’s say MIT/X11 … I am going to add the proper LICENSE file
>> momentarily). It started at [4] (look at the source view), but it
>> doesn’t seem to be that useful anymore.
>>
>> >    - And if you are knowledgeable about the intellectual property laws
>> in
>> >    other countries, I'm interested : currently, I've added a section to
>> the
>> >    README explaining why the usage of the scraper on any public website
>> is
>> >    allowed in France with references to the related texts, but it would
>> >    probably be useful to have similar information for users from other
>> >    countries.
>>
>> I am absolutely certain, there are no problems with CzeBKR:
>>
>>     1. It is WikiSource, so we have somebody else to blame ;)
>>     2. The original Bible of Kralice [5] is from the sixteenth
>>        century and it is absolutely in the public domain.
>>     3. Source for the WikiSource was a scan [6] of the book
>>        from 1918, without any authors shown. The works of only
>>        possible editor of that Bible I know about [7] (and he is
>>        not shown on the title page, but he was working in the
>>        early 20th century with the International Bible Society on
>>        the revision of the Bible) are under the Bern Convention
>>        (death in 1929 + 75 years) in the public domain as well.
>>     4. We are in EU as well.
>>
>> If you want to use CzeBKR as your test case, I am ready to help
>> you with any testing or Czech issues or whatever.
>>
>> Blessed Sunday!
>>
>> Matěj
>>
>> [1] https://cs.wikisource.org/wiki/Bible_kralick%C3%A1_(1918)
>> [2]
>> https://ws-export.wmcloud.org/?lang=cs&title=Bible_kralick%C3%A1_%281918%29
>> [3]
>> https://gitlab.com/crosswire-bible-society/CzeBKR/-/blob/master/kralicka.py
>> [4]
>> https://cs.wikisource.org/wiki/Speci%C3%A1ln%C3%AD:Exportovat_str%C3%A1nky/Bible_kralick%C3%A1_(1918)
>> [5] https://en.wikipedia.org/wiki/Bible_of_Kralice
>> [6] http://archive.org/details/biblsvatanebvec00socigoog
>> [7] https://cs.wikipedia.org/wiki/Jan_Karafi%C3%A1t
>> --
>> http://matej.ceplovi.cz/blog/, @mcepl at floss.social
>> GPG Finger: 3C76 A027 CA45 AD70 98B5  BC1D 7920 5802 880B C9D8
>>
>> The ratio of literacy to illiteracy is a constant, but nowadays
>> the illiterates can read.
>>     -- Alberto Moravia
>>
>> _______________________________________________
>> sword-devel mailing list: sword-devel at crosswire.org
>> http://crosswire.org/mailman/listinfo/sword-devel
>> Instructions to unsubscribe/change your settings at above page
>>
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.orghttp://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
>
> --
> Vous aimez la Bible ? Vous êtes étudiant en théologie ? Utilisez
> l'application libre Xiphos <https://xiphos.org/> ou Andbible
> <https://andbible.github.io/> et accédez aux textes sources, à des
> commentaires, des dictionnaires et beaucoup d'autres fonctionnalités... Me
> contacter pour des traductions en français.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20240603/bcdd0c6b/attachment.htm>


More information about the sword-devel mailing list