Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Gutenberg scraping (no more rsync, no more fallback URLs / filenames) #97

Open
kelson42 opened this issue Nov 14, 2019 · 32 comments
Assignees
Milestone

Comments

@kelson42
Copy link
Contributor

kelson42 commented Nov 14, 2019

Gutenberg scraping is pretty touchy and takes time to run. We should see if it can be simplified.

A start of this discussion has happend here #93 (comment)

@kelson42
Copy link
Contributor Author

@dattaz @rgaudin Would you be able to summarise why this is so slow/complicated? So we can assessed alternatives in a second step?

@kelson42
Copy link
Contributor Author

Today, we download http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 which has metadata for all books (is that the same kind of data like https://www.gutenberg.org/feeds/catalog.rdf.bz2?). This process (download + parsing) takes a few hours. It looks like the RDF is not 100% OK because we make an rsync to list all the files on gb server and have a way to check that the URLs given by the RDF are really on the server. But then we have the correct EPUB URLS and we can rely on them to download the data.

@eshellman So the question is: the OPDS is a standard and looks easier to deal with for us... but if it suffers from the same problem in data quality like the RDF we use today, then we will end-up with exactly the same kind of "bad solution".

@kelson42 kelson42 changed the title Simplify Gutenberg scraping Move to OPDS catalog (Simplify Gutenberg scraping) Jul 13, 2020
@kelson42 kelson42 pinned this issue Jul 13, 2020
@stale
Copy link

stale bot commented Sep 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Sep 11, 2020
@eshellman
Copy link
Collaborator

It seems I never commented on this. The PG OPDS is archaic. it does things like embedding covers as blobs in the xml.

Probably I meant to take a look at the rdf parsing code to see what the inefficiency is. I'm pretty sure my own parsing code doesn't take nearly as long: https://github.com/gitenberg-dev/gitberg/blob/fc3da308f3ccdfe034d2e873efff9adf6a66730f/gitenberg/metadata/pg_rdf.py#L267

@kelson42
Copy link
Contributor Author

@eshellman From our perspective having a proper OPDS would be preferable, because this is a standard.

@stale stale bot removed the stale label Jan 13, 2021
@eshellman
Copy link
Collaborator

we'll probably go straight to opds 2.0 (json-based) sometime this year

@stale
Copy link

stale bot commented Mar 19, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@rgaudin
Copy link
Member

rgaudin commented Dec 24, 2022

Has this OPDS2 effort been through? https://m.gutenberg.org/ebooks.opds/ is still 1.1 and XML and returns

<!--
DON'T USE THIS PAGE FOR SCRAPING.
Seriously. You'll only get your IP blocked.
Download https://www.gutenberg.org/feeds/catalog.rdf.bz2 instead,
which contains *all* Project Gutenberg metadata in one RDF/XML file.
-->

@kelson42
Copy link
Contributor Author

The best to answer is probably @eshellman

@eshellman
Copy link
Collaborator

I would not rely on that feed, as it was coded before the OPDS standard had stabilized, and does not get much usage. Implementation of OPDS 2 will happen eventually; most of 2022 was spent on the EPUB3 and HTML5 output files.
In addition to the RDF, there is now a CSV file that gets a fair amount of use.

@benoit74
Copy link
Collaborator

benoit74 commented Jan 7, 2023

I just had a look into some stuff around this issue, and this is what I understood (do not hesitate to correct me if I'm wrong).

When the scraper starts, we now (2022) retrieve two sources:

  • rdf-files.tar.bz2
  • rsync of rsync://dante.pglaf.org/gutenberg/

The RDF tar is opened up and every individual RDF file is used to get:

  • a list of all licenses: slug (PD, None, Copyright) + name (full text representing the license)
  • a list of formats: id (autogenerated) + mime + images (boolean indicating if the format contains images or not) + pattern (how to generate the filename based on book id)
  • a list of authors: gut_id (id of the author in gutenberg, agent in RDF) + last_name + first_names + birth_date + death_date
  • a list of books : id (gutemberg) + title + subtitle + author_id (link to authors above) + license_id (link to licenses above) + language + downloads + bookshelf + coverpage (unused ???) + a list of format_id (link to formats above)

The rsync is populating a table of relative URLs found on the server.

When downloading the book, the following is done for every book:

  • based on a provided list of formats (pdf, epub, html by default), we look after which format found in the RDF is the most "likely" to represent the file we want to download (e.g. there is typically multiple epub format available for on single book, but we have to choose one)
  • if we have stored in the database the download URL for this book in this format, we consider
  • otherwise, we infer a list of potential filenames based on known patterns (e.g. for epub, we look after .../pg{book_id}.epub, .../pg{book_id}-images.epub, .../pg{book_id}-noimages.epub where .../ is based on server base path + folder structure base on book id) ; we then look after urls (from rsync) for the ones matching ; we try to download the file from matching urls, the first succesful one is considered as the one we want and is stored in database as the appropriate one for this book in this format
  • the cover is then downloaded from a known URL (if available)

These files are then optimized / cached / stored in the final ZIM.

Some remarks :

The links found in the various RDFs seem to always point to www.gutenberg.org server, is it ok to use it for scraping ? (I have no idea why we use dante.pglaf.org instead of www.gutenberg.org)

If the remarks above are true, and since it is now quite easy / fast to get a list of book ids (via the new CSV file), we can imagine a new processing structure where we first get this CSV and build a list of book ids, and then we can directly jump to the processing of individual books, with the download of its RDF and then its files.

This would means no more need to rsync tons of stuff + no more need to untar all RDFs if we want to extract only few books (particularly important for testing and specific languages) and overall very limited upfront time. Maybe no big change in overall processing time when running a full, since we would download all RDFs one by one instead of a huge tar. But definitely a simplified processing from my PoV + something faster for small tests.

@rgaudin
Copy link
Member

rgaudin commented Jan 9, 2023

since it is now quite easy / fast to get a list of book ids (via the new CSV file), we can imagine a new processing structure where we first get this CSV and build a list of book ids, and then we can directly jump to the processing of individual books, with the download of its RDF and then its files.

The CSV file does include the list of languages which is great but it doesn't include the list of formats so it would not be sufficient to run all the filters (languages, formats, book_ids). Would still work (you'd filter once you've parsed the RDF) but you wouldn't know in advance which books you'd need to process.

One thing that could be done is to replace the indiscriminate extraction and parsing of all RDF files. This takes a lot of time because this has to go through the filesystem.
Instead, we could (after filtering books_ids from the lang request via CSV) only extract from rdf-files.tar.bz2 (in memory) and parse (from memory) the individual book IDs.

This would be a lot faster on small selections and probably still faster on large ones ; but bundling two consecutive tasks together.

I don't see how the rsync step can be replaced though and this is the longest one for me (40m today) because we use it to match filenames.
One thing we could do is save its result in the optimization cache. As the recipe runs on zimfarm worker periodically, there would a somewhat recent file available for developers to use. Kind of hackish but could be useful.

I have no idea why we use dante.pglaf.org instead of www.gutenberg.org

It's the mirror and that's the recommended way to.

@eshellman
Copy link
Collaborator

www.gutenberg.org has load balancers and securuty hardware in front of it that aren't very friendly to scrapers - it's architected for large numbers of users. dante just has bandwidth.

@benoit74
Copy link
Collaborator

Instead, we could (after filtering books_ids from the lang request via CSV) only extract from rdf-files.tar.bz2 (in memory) and parse (from memory) the individual book IDs.

This makes a lot of sense to me.

I don't see how the rsync step can be replaced though and this is the longest one for me (40m today) because we use it to match filenames.

I not sure I get this. Why do you have to match filenames? Do you mean that the URL on dante is not the same as the one on www.gutenberg.org which is the URL we get from the RDF?

@rgaudin
Copy link
Member

rgaudin commented Jan 19, 2023

I not sure I get this. Why do you have to match filenames? Do you mean that the URL on dante is not the same as the one on www.gutenberg.org which is the URL we get from the RDF?

I'm not sure exactly but I believe there are many files referenced in RDF but not all are present on the server and testing online URLs each time was too slow, inefficient and wasteful.

Not sure how linked this is to the fact we were using different mirrors in the past.

Anyway now that the scraper is in a better shape I'd advise you test whether we still need that rsync step or not.

Should be fairly easy: loop through file entries in all of the RDF files and check if those urls arr present in rsync. If none is missing ; it would mean we can trust RDF files.

@eshellman
Copy link
Collaborator

If you have examples of missing files, I can check for you what the issue is.

@kelson42
Copy link
Contributor Author

Actually, if such information are welcome on PG side, would be better IMO to share errors with PG so the can fix it, than inventing solutions to circunvent them.

@eshellman
Copy link
Collaborator

without seeing examples, it's hard to know where the problem lies.

@benoit74
Copy link
Collaborator

benoit74 commented Jan 19, 2023 via email

@benoit74
Copy link
Collaborator

I may finally have gained a bit more of understanding around why we need to rsync URLs from dante.pglaf.org.

Here is a sample data for book ID 1088:

Mime type: text/html 
URL in RDF: https://www.gutenberg.org/files/1088/1088-h.zip
File downloaded from mirror : http://dante.pglaf.org/1/0/8/1088/1088-h.zip

Mime type: application/epub+zip
URL in RDF: https://www.gutenberg.org/ebooks/1088.epub.images
File downloaded from mirror: http://dante.pglaf.org/cache/epub/1088/pg1088.epub

We see that the URL used on dante.pglaf.org is absolutely not the same relative path as the one mentioned in the RDF.
While on www.gutenberg.org we can use both path structure (/files/1088/1088-h.zip or /1/0/8/1088/1088-h.zip), only the second one works on dante.pglaf.org.

We also see that even the filename is very different for epub.

I also had a look at the RDF file present on http://dante.pglaf.org/cache/epub/1088/pg1088.rdf, but it is also only mentioning URLs of www.gutenberg.org.

And there is almost the same issue regarding the cover image, where multiple resolutions are mentioned in the RDF but only one is available on dante.pglaf.org, with a different filename.

I'm a bit lost regarding what we should next if we want to simplify scraping further / get rid of the rsync.

@eshellman, are you aware of all this? Any suggestion?

@eshellman
Copy link
Collaborator

eshellman commented Jan 31, 2023 via email

@eshellman
Copy link
Collaborator

eshellman commented Jan 31, 2023 via email

@benoit74
Copy link
Collaborator

benoit74 commented Feb 1, 2023

Any or both changes would be more than welcomed! Thank you, keep us informed.

@eshellman
Copy link
Collaborator

I have created a small module bringing together the relevant code with a method that will turn the canonical urls into archive urls:
https://github.com/gutenbergtools/libgutenberg/blob/master/pg_archive_urls.py

I've identified 3 files (of 1.2 million) in a legacy format that need fixing from our side

@eshellman
Copy link
Collaborator

Didn't make sense to mess with the rdf.

@rgaudin
Copy link
Member

rgaudin commented Feb 3, 2023

Great news ! We'll test and integrate it

@benoit74
Copy link
Collaborator

benoit74 commented Feb 3, 2023

Looks very promising, thank you very much !

I've integrated it manually (i.e. copy/pasted the code until you release it) and modified the DB schema to store more info for debugging. I will perform a full run of the parser to compare what we have downloaded with URLs guessed from rsync and URLs available in RDF as translated by your lib. I'll let you know.

First try on one single book is OK (i.e. we could have used the RDF urls directly).

@benoit74
Copy link
Collaborator

benoit74 commented Feb 4, 2023

I did a full run, looking for all formats (epub, html, pdf) and all languages.

Only 59251 books have at least one of those formats. I'm downloading a full archive to confirm this number is ok, but it looks so. By the way, is there any plan to support those additional formats (text/plain + audio books)? Or any reason not to add them?

For most books, the URLs (which we hint currently from rsync results + patterns) we download for epub, html, pdf and cover image are already present in the RDF files and are equal once converted through the small module mentioned above.

The only exceptions are below. @eshellman could you have a look and confirm this should / could be fixed on your side?

book_id|format|download_url
5031|html|http://dante.pglaf.org/5/0/3/5031/5031-h.zip
11220|html|http://dante.pglaf.org/cache/epub/11220/pg11220.html.utf8
15831|pdf|http://dante.pglaf.org/1/5/8/3/15831/15831-pdf.pdf
10802|html|http://dante.pglaf.org/cache/epub/10802/pg10802.html.utf8
28701|html|http://dante.pglaf.org/2/8/7/0/28701/28701-h.zip
28803|html|http://dante.pglaf.org/2/8/8/0/28803/28803-h.zip
28821|html|http://dante.pglaf.org/2/8/8/2/28821/28821-h.zip
28959|html|http://dante.pglaf.org/2/8/9/5/28959/28959-h.zip
28969|html|http://dante.pglaf.org/2/8/9/6/28969/28969-h.zip
31100|html|http://dante.pglaf.org/3/1/1/0/31100/31100-h.zip
29156|html|http://dante.pglaf.org/2/9/1/5/29156/29156-h.zip
29434|html|http://dante.pglaf.org/2/9/4/3/29434/29434-h.zip
29441|html|http://dante.pglaf.org/2/9/4/4/29441/29441-h.zip
29467|html|http://dante.pglaf.org/2/9/4/6/29467/29467-h.zip
30580|html|http://dante.pglaf.org/3/0/5/8/30580/30580-h.zip
41450|html|http://dante.pglaf.org/4/1/4/5/41450/41450-h.zip
51830|html|http://dante.pglaf.org/5/1/8/3/51830/51830-h.zip
66127|html|http://dante.pglaf.org/6/6/1/2/66127/66127-h.zip
69909|cover|http://dante.pglaf.org/cache/epub/69909/pg69909.cover.medium.jpg
69910|cover|http://dante.pglaf.org/cache/epub/69910/pg69910.cover.medium.jpg
69911|cover|http://dante.pglaf.org/cache/epub/69911/pg69911.cover.medium.jpg
69912|cover|http://dante.pglaf.org/cache/epub/69912/pg69912.cover.medium.jpg
69913|cover|http://dante.pglaf.org/cache/epub/69913/pg69913.cover.medium.jpg
69915|cover|http://dante.pglaf.org/cache/epub/69915/pg69915.cover.medium.jpg
69916|cover|http://dante.pglaf.org/cache/epub/69916/pg69916.cover.medium.jpg
69917|cover|http://dante.pglaf.org/cache/epub/69917/pg69917.cover.medium.jpg
69918|cover|http://dante.pglaf.org/cache/epub/69918/pg69918.cover.medium.jpg
69919|cover|http://dante.pglaf.org/cache/epub/69919/pg69919.cover.medium.jpg
69920|cover|http://dante.pglaf.org/cache/epub/69920/pg69920.cover.medium.jpg
69921|cover|http://dante.pglaf.org/cache/epub/69921/pg69921.cover.medium.jpg
69922|cover|http://dante.pglaf.org/cache/epub/69922/pg69922.cover.medium.jpg
69923|cover|http://dante.pglaf.org/cache/epub/69923/pg69923.cover.medium.jpg
69924|cover|http://dante.pglaf.org/cache/epub/69924/pg69924.cover.medium.jpg
69925|cover|http://dante.pglaf.org/cache/epub/69925/pg69925.cover.medium.jpg
69926|cover|http://dante.pglaf.org/cache/epub/69926/pg69926.cover.medium.jpg
69927|cover|http://dante.pglaf.org/cache/epub/69927/pg69927.cover.medium.jpg
69928|cover|http://dante.pglaf.org/cache/epub/69928/pg69928.cover.medium.jpg
69930|cover|http://dante.pglaf.org/cache/epub/69930/pg69930.cover.medium.jpg
69931|cover|http://dante.pglaf.org/cache/epub/69931/pg69931.cover.medium.jpg
69932|cover|http://dante.pglaf.org/cache/epub/69932/pg69932.cover.medium.jpg
69934|cover|http://dante.pglaf.org/cache/epub/69934/pg69934.cover.medium.jpg
69935|cover|http://dante.pglaf.org/cache/epub/69935/pg69935.cover.medium.jpg
69936|cover|http://dante.pglaf.org/cache/epub/69936/pg69936.cover.medium.jpg
69937|cover|http://dante.pglaf.org/cache/epub/69937/pg69937.cover.medium.jpg
69938|cover|http://dante.pglaf.org/cache/epub/69938/pg69938.cover.medium.jpg
69939|cover|http://dante.pglaf.org/cache/epub/69939/pg69939.cover.medium.jpg

Anyway, this is probably a significant first confirmation that your module works fine and that we can probably get rid of the rsync step and maybe other complexities for hinting appropriate file name for the various formats. I will continue to explore this by looking at how to select the appropriate file from the RDF for the three formats we currently support.

@benoit74
Copy link
Collaborator

benoit74 commented Feb 4, 2023

The number 59251 is not ok, but it looks like I'm missing some urls from rsync ... I might have messed up with messy data ... I will run it once more with the whole rsync step to confirm, I might have missed some records.

@stale
Copy link

stale bot commented May 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label May 26, 2023
@benoit74 benoit74 changed the title Move to OPDS catalog (Simplify Gutenberg scraping) Simplify Gutenberg scraping (no more rsync, no more fallback URLs / filenames) Mar 5, 2024
@stale stale bot removed the stale label Mar 5, 2024
@benoit74
Copy link
Collaborator

benoit74 commented Mar 5, 2024

I've updated this issue title to better reflect the current state of the discussion here.

@eshellman
Copy link
Collaborator

  1. I have updated https://github.com/gutenbergtools/libgutenberg/blob/master/pg_archive_urls.py
  2. PG is now generating a zip file for the html5 version of every book including all of the images. I think these will be much easier to use for openzim, as well as more efficient wrt bandwidth.
  3. I think I've mentioned this before, but I maintain a list of PG numbers that are not books (and won't have html5/zips available) or are not being used: https://github.com/gitenberg-dev/gitberg/blob/master/gitenberg/data/missing.tsv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants