[sword-devel] Introducing the Bible Scraper
Arnaud Vié
unas.zole+avie at gmail.com
Sun Jun 2 05:46:21 EDT 2024
Thank you both for your interest !
> What about commentary?
> https://www.awmi.net/reading/online-bible-commentary/
Not yet, I'm really focusing on bibles for the time being - that's a lot of
work already !
But nothing prevents adapting the solution to commentaries in the future,
I'll keep that idea in mind :-)
> If you want to use CzeBKR as your test case, I am ready to help
> you with any testing or Czech issues or whatever
Thanks a lot !
I've just pushed a scraper configuration for this bible :
https://github.com/UnasZole/bible-scraper/blob/master/src/main/resources/scrapers/GenericHtml/KralickaWikisource.yaml
Main books were easy to parse - deuterocanonical books extracted from a
different manuscript were a bit messier.
I made a few assumptions (I interpret italics in verse as translation
additions, and side notes in deuterocanonical books as section titles, etc.)
Feel free to test it : after checking out and building the repository, you
should just need to run for example:
> ./run.sh scrape -s GenericHtml -i KralickaWikisource -b Ps -c 1 -w USFM
Cheers,
Arnaud
Le dim. 2 juin 2024 à 08:50, Matěj Cepl <mcepl at cepl.eu> a écrit :
> On Sun Jun 2, 2024 at 1:09 AM CEST, Arnaud Vié wrote:
> > I'm open to any kind of feedback or suggestions of course !
> > In particular :
> >
> > - if you have any specific website in mind that you would like to be
> > able to build sword modules from, let me know, we can try to add it.
> > (Currently I only included a few French websites, but I'm interested
> to add
> > some other languages).
>
> Sword module CzeBKR is sourced from the Czech WikiSource [1]
> and there seems to be the official way [2] how to get source
> in some hopefully more useful formats (plain text, RTF, HTML,
> EPubs). I was using my own home-grown Python script [3], but it
> seems like with all web-scrapping scripts it rotten away (that
> script is under some of kind of very free open source license,
> let’s say MIT/X11 … I am going to add the proper LICENSE file
> momentarily). It started at [4] (look at the source view), but it
> doesn’t seem to be that useful anymore.
>
> > - And if you are knowledgeable about the intellectual property laws in
> > other countries, I'm interested : currently, I've added a section to
> the
> > README explaining why the usage of the scraper on any public website
> is
> > allowed in France with references to the related texts, but it would
> > probably be useful to have similar information for users from other
> > countries.
>
> I am absolutely certain, there are no problems with CzeBKR:
>
> 1. It is WikiSource, so we have somebody else to blame ;)
> 2. The original Bible of Kralice [5] is from the sixteenth
> century and it is absolutely in the public domain.
> 3. Source for the WikiSource was a scan [6] of the book
> from 1918, without any authors shown. The works of only
> possible editor of that Bible I know about [7] (and he is
> not shown on the title page, but he was working in the
> early 20th century with the International Bible Society on
> the revision of the Bible) are under the Bern Convention
> (death in 1929 + 75 years) in the public domain as well.
> 4. We are in EU as well.
>
> If you want to use CzeBKR as your test case, I am ready to help
> you with any testing or Czech issues or whatever.
>
> Blessed Sunday!
>
> Matěj
>
> [1] https://cs.wikisource.org/wiki/Bible_kralick%C3%A1_(1918)
> [2]
> https://ws-export.wmcloud.org/?lang=cs&title=Bible_kralick%C3%A1_%281918%29
> [3]
> https://gitlab.com/crosswire-bible-society/CzeBKR/-/blob/master/kralicka.py
> [4]
> https://cs.wikisource.org/wiki/Speci%C3%A1ln%C3%AD:Exportovat_str%C3%A1nky/Bible_kralick%C3%A1_(1918)
> [5] https://en.wikipedia.org/wiki/Bible_of_Kralice
> [6] http://archive.org/details/biblsvatanebvec00socigoog
> [7] https://cs.wikipedia.org/wiki/Jan_Karafi%C3%A1t
> --
> http://matej.ceplovi.cz/blog/, @mcepl at floss.social
> GPG Finger: 3C76 A027 CA45 AD70 98B5 BC1D 7920 5802 880B C9D8
>
> The ratio of literacy to illiteracy is a constant, but nowadays
> the illiterates can read.
> -- Alberto Moravia
>
> _______________________________________________
> sword-devel mailing list: sword-devel at crosswire.org
> http://crosswire.org/mailman/listinfo/sword-devel
> Instructions to unsubscribe/change your settings at above page
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://crosswire.org/pipermail/sword-devel/attachments/20240602/c260f5cd/attachment-0001.htm>
More information about the sword-devel
mailing list