[sword-devel] Introducing the Bible Scraper

Fr Cyrille fr.cyrille at tiberiade.be
Sun Jun 2 10:33:12 EDT 2024


Hi Arnaud,
What do you think to move bible-scraper from github repo to our gitlab 
repo? I did this but not with the last commits. I make you dev on it. 
https://gitlab.com/crosswire-bible-society/bible-scraper

Le 02/06/2024 à 11:46, Arnaud Vié a écrit :
> Thank you both for your interest !
>
> > What about commentary?
> > https://www.awmi.net/reading/online-bible-commentary/
>
> Not yet, I'm really focusing on bibles for the time being - that's a 
> lot of work already !
> But nothing prevents adapting the solution to commentaries in the 
> future, I'll keep that idea in mind :-)
>
> > If you want to use CzeBKR as your test case, I am ready to help
> > you with any testing or Czech issues or whatever
>
> Thanks a lot !
> I've just pushed a scraper configuration for this bible : 
> https://github.com/UnasZole/bible-scraper/blob/master/src/main/resources/scrapers/GenericHtml/KralickaWikisource.yaml
> Main books were easy to parse - deuterocanonical books extracted from 
> a different manuscript were a bit messier.
> I made a few assumptions (I interpret italics in verse as translation 
> additions, and side notes in deuterocanonical books as section titles, 
> etc.)
> Feel free to test it : after checking out and building the repository, 
> you should just need to run for example:
>
> > ./run.sh scrape -s GenericHtml -i KralickaWikisource -b Ps -c 1 -w USFM
>
> Cheers,
>
> Arnaud
>
> Le dim. 2 juin 2024 à 08:50, Matěj Cepl <mcepl at cepl.eu> a écrit :
>
>     On Sun Jun 2, 2024 at 1:09 AM CEST, Arnaud Vié wrote:
>     > I'm open to any kind of feedback or suggestions of course !
>     > In particular :
>     >
>     >    - if you have any specific website in mind that you would
>     like to be
>     >    able to build sword modules from, let me know, we can try to
>     add it.
>     >    (Currently I only included a few French websites, but I'm
>     interested to add
>     >    some other languages).
>
>     Sword module CzeBKR is sourced from the Czech WikiSource [1]
>     and there seems to be the official way [2] how to get source
>     in some hopefully more useful formats (plain text, RTF, HTML,
>     EPubs). I was using my own home-grown Python script [3], but it
>     seems like with all web-scrapping scripts it rotten away (that
>     script is under some of kind of very free open source license,
>     let’s say MIT/X11 … I am going to add the proper LICENSE file
>     momentarily). It started at [4] (look at the source view), but it
>     doesn’t seem to be that useful anymore.
>
>     >    - And if you are knowledgeable about the intellectual
>     property laws in
>     >    other countries, I'm interested : currently, I've added a
>     section to the
>     >    README explaining why the usage of the scraper on any public
>     website is
>     >    allowed in France with references to the related texts, but
>     it would
>     >    probably be useful to have similar information for users from
>     other
>     >    countries.
>
>     I am absolutely certain, there are no problems with CzeBKR:
>
>         1. It is WikiSource, so we have somebody else to blame ;)
>         2. The original Bible of Kralice [5] is from the sixteenth
>            century and it is absolutely in the public domain.
>         3. Source for the WikiSource was a scan [6] of the book
>            from 1918, without any authors shown. The works of only
>            possible editor of that Bible I know about [7] (and he is
>            not shown on the title page, but he was working in the
>            early 20th century with the International Bible Society on
>            the revision of the Bible) are under the Bern Convention
>            (death in 1929 + 75 years) in the public domain as well.
>         4. We are in EU as well.
>
>     If you want to use CzeBKR as your test case, I am ready to help
>     you with any testing or Czech issues or whatever.
>
>     Blessed Sunday!
>
>     Matěj
>
>     [1] https://cs.wikisource.org/wiki/Bible_kralick%C3%A1_(1918)
>     [2]
>     https://ws-export.wmcloud.org/?lang=cs&title=Bible_kralick%C3%A1_%281918%29
>     <https://ws-export.wmcloud.org/?lang=cs&title=Bible_kralick%C3%A1_%281918%29>
>     [3]
>     https://gitlab.com/crosswire-bible-society/CzeBKR/-/blob/master/kralicka.py
>     [4]
>     https://cs.wikisource.org/wiki/Speci%C3%A1ln%C3%AD:Exportovat_str%C3%A1nky/Bible_kralick%C3%A1_(1918)
>     [5] https://en.wikipedia.org/wiki/Bible_of_Kralice
>     [6] http://archive.org/details/biblsvatanebvec00socigoog
>     [7] https://cs.wikipedia.org/wiki/Jan_Karafi%C3%A1t
>     -- 
>     http://matej.ceplovi.cz/blog/, @mcepl at floss.social
>     GPG Finger: 3C76 A027 CA45 AD70 98B5  BC1D 7920 5802 880B C9D8
>
>     The ratio of literacy to illiteracy is a constant, but nowadays
>     the illiterates can read.
>         -- Alberto Moravia
>
>     _______________________________________________
>     sword-devel mailing list: sword-devel at crosswire.org
>     http://crosswire.org/mailman/listinfo/sword-devel
>     Instructions to unsubscribe/change your settings at above page
>
>
> _______________________________________________
> sword-devel mailing list:sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page

-- 
Vous aimez la Bible ? Vous êtes étudiant en théologie ? Utilisez 
l'application libre Xiphos <https://xiphos.org/> ou Andbible 
<https://andbible.github.io/> et accédez aux textes sources, à des 
commentaires, des dictionnaires et beaucoup d'autres fonctionnalités... 
Me contacter pour des traductions en français.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20240602/0f53919f/attachment.htm>


More information about the sword-devel mailing list