-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Gutenberg scraping (no more rsync, no more fallback URLs / filenames) #97
Comments
Today, we download http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 which has metadata for all books (is that the same kind of data like https://www.gutenberg.org/feeds/catalog.rdf.bz2?). This process (download + parsing) takes a few hours. It looks like the RDF is not 100% OK because we make an rsync to list all the files on gb server and have a way to check that the URLs given by the RDF are really on the server. But then we have the correct EPUB URLS and we can rely on them to download the data. @eshellman So the question is: the OPDS is a standard and looks easier to deal with for us... but if it suffers from the same problem in data quality like the RDF we use today, then we will end-up with exactly the same kind of "bad solution". |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
It seems I never commented on this. The PG OPDS is archaic. it does things like embedding covers as blobs in the xml. Probably I meant to take a look at the rdf parsing code to see what the inefficiency is. I'm pretty sure my own parsing code doesn't take nearly as long: https://github.com/gitenberg-dev/gitberg/blob/fc3da308f3ccdfe034d2e873efff9adf6a66730f/gitenberg/metadata/pg_rdf.py#L267 |
@eshellman From our perspective having a proper OPDS would be preferable, because this is a standard. |
we'll probably go straight to opds 2.0 (json-based) sometime this year |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
Has this OPDS2 effort been through? https://m.gutenberg.org/ebooks.opds/ is still
|
The best to answer is probably @eshellman |
I would not rely on that feed, as it was coded before the OPDS standard had stabilized, and does not get much usage. Implementation of OPDS 2 will happen eventually; most of 2022 was spent on the EPUB3 and HTML5 output files. |
I just had a look into some stuff around this issue, and this is what I understood (do not hesitate to correct me if I'm wrong). When the scraper starts, we now (2022) retrieve two sources:
The RDF tar is opened up and every individual RDF file is used to get:
The rsync is populating a table of relative URLs found on the server. When downloading the book, the following is done for every book:
These files are then optimized / cached / stored in the final ZIM. Some remarks :
The links found in the various RDFs seem to always point to www.gutenberg.org server, is it ok to use it for scraping ? (I have no idea why we use dante.pglaf.org instead of www.gutenberg.org) If the remarks above are true, and since it is now quite easy / fast to get a list of book ids (via the new CSV file), we can imagine a new processing structure where we first get this CSV and build a list of book ids, and then we can directly jump to the processing of individual books, with the download of its RDF and then its files. This would means no more need to rsync tons of stuff + no more need to untar all RDFs if we want to extract only few books (particularly important for testing and specific languages) and overall very limited upfront time. Maybe no big change in overall processing time when running a full, since we would download all RDFs one by one instead of a huge tar. But definitely a simplified processing from my PoV + something faster for small tests. |
The CSV file does include the list of languages which is great but it doesn't include the list of formats so it would not be sufficient to run all the filters (languages, formats, book_ids). Would still work (you'd filter once you've parsed the RDF) but you wouldn't know in advance which books you'd need to process. One thing that could be done is to replace the indiscriminate extraction and parsing of all RDF files. This takes a lot of time because this has to go through the filesystem. This would be a lot faster on small selections and probably still faster on large ones ; but bundling two consecutive tasks together. I don't see how the rsync step can be replaced though and this is the longest one for me (40m today) because we use it to match filenames.
It's the mirror and that's the recommended way to. |
www.gutenberg.org has load balancers and securuty hardware in front of it that aren't very friendly to scrapers - it's architected for large numbers of users. dante just has bandwidth. |
This makes a lot of sense to me.
I not sure I get this. Why do you have to match filenames? Do you mean that the URL on dante is not the same as the one on www.gutenberg.org which is the URL we get from the RDF? |
I'm not sure exactly but I believe there are many files referenced in RDF but not all are present on the server and testing online URLs each time was too slow, inefficient and wasteful. Not sure how linked this is to the fact we were using different mirrors in the past. Anyway now that the scraper is in a better shape I'd advise you test whether we still need that rsync step or not. Should be fairly easy: loop through file entries in all of the RDF files and check if those urls arr present in rsync. If none is missing ; it would mean we can trust RDF files. |
If you have examples of missing files, I can check for you what the issue is. |
Actually, if such information are welcome on PG side, would be better IMO to share errors with PG so the can fix it, than inventing solutions to circunvent them. |
without seeing examples, it's hard to know where the problem lies. |
I will perform a full comparison of the data sources we use and let you
know if I find something unexpected, so we can take a decision of what to
do collectively. Probably this week or next one.
|
I may finally have gained a bit more of understanding around why we need to rsync URLs from Here is a sample data for book ID
We see that the URL used on We also see that even the filename is very different for epub. I also had a look at the RDF file present on And there is almost the same issue regarding the cover image, where multiple resolutions are mentioned in the RDF but only one is available on I'm a bit lost regarding what we should next if we want to simplify scraping further / get rid of the rsync. @eshellman, are you aware of all this? Any suggestion? |
We could update the rdf - it hasn't been touched 5years, or I could contribute a some code that does the translation. or both. The first case is lack of symlinks, the second is an apache rewrite directive.
… On Jan 27, 2023, at 3:52 AM, benoit74 ***@***.***> wrote:
I may finally have gained a bit more of understanding around why we need to rsync URLs from dante.pglaf.org.
Here is a sample data for book ID 1088:
Mime type: text/html
URL in RDF: https://www.gutenberg.org/files/1088/1088-h.zip
File downloaded from mirror : http://dante.pglaf.org/1/0/8/1088/1088-h.zip
Mime type: application/epub+zip
URL in RDF: https://www.gutenberg.org/ebooks/1088.epub.images
File downloaded from mirror: http://dante.pglaf.org/cache/epub/1088/pg1088.epub
We see that the URL used on dante.pglaf.org is absolutely not the same relative path as the one mentioned in the RDF.
While on www.gutenberg.org we can use both path structure (/files/1088/1088-h.zip or /1/0/8/1088/1088-h.zip), only the second one works on dante.pglaf.org.
We also see that even the filename is very different for epub.
I also had a look at the RDF file present on http://dante.pglaf.org/cache/epub/1088/pg1088.rdf, but it is also only mentioning URLs of www.gutenberg.org.
And there is almost the same issue regarding the cover image, where multiple resolutions are mentioned in the RDF but only one is available on dante.pglaf.org, with a different filename.
I'm a bit lost regarding what we should next if we want to simplify scraping further / get rid of the rsync.
@eshellman <https://github.com/eshellman>, are you aware of all this? Any suggestion?
—
Reply to this email directly, view it on GitHub <#97 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHCGMO5X3Q5TXQELVGSELTWUOEGHANCNFSM4JNNZJOA>.
You are receiving this because you were mentioned.
|
I looked at the code generating the rdf, and I could just delete all the code the does the file-url transformation! Will ask around to see if there's any reason not to.
… On Jan 27, 2023, at 3:52 AM, benoit74 ***@***.***> wrote:
I may finally have gained a bit more of understanding around why we need to rsync URLs from dante.pglaf.org.
Here is a sample data for book ID 1088:
Mime type: text/html
URL in RDF: https://www.gutenberg.org/files/1088/1088-h.zip
File downloaded from mirror : http://dante.pglaf.org/1/0/8/1088/1088-h.zip
Mime type: application/epub+zip
URL in RDF: https://www.gutenberg.org/ebooks/1088.epub.images
File downloaded from mirror: http://dante.pglaf.org/cache/epub/1088/pg1088.epub
We see that the URL used on dante.pglaf.org is absolutely not the same relative path as the one mentioned in the RDF.
While on www.gutenberg.org we can use both path structure (/files/1088/1088-h.zip or /1/0/8/1088/1088-h.zip), only the second one works on dante.pglaf.org.
We also see that even the filename is very different for epub.
I also had a look at the RDF file present on http://dante.pglaf.org/cache/epub/1088/pg1088.rdf, but it is also only mentioning URLs of www.gutenberg.org.
And there is almost the same issue regarding the cover image, where multiple resolutions are mentioned in the RDF but only one is available on dante.pglaf.org, with a different filename.
I'm a bit lost regarding what we should next if we want to simplify scraping further / get rid of the rsync.
@eshellman <https://github.com/eshellman>, are you aware of all this? Any suggestion?
—
Reply to this email directly, view it on GitHub <#97 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHCGMO5X3Q5TXQELVGSELTWUOEGHANCNFSM4JNNZJOA>.
You are receiving this because you were mentioned.
|
Any or both changes would be more than welcomed! Thank you, keep us informed. |
I have created a small module bringing together the relevant code with a method that will turn the canonical urls into archive urls: I've identified 3 files (of 1.2 million) in a legacy format that need fixing from our side |
Didn't make sense to mess with the rdf. |
Great news ! We'll test and integrate it |
Looks very promising, thank you very much ! I've integrated it manually (i.e. copy/pasted the code until you release it) and modified the DB schema to store more info for debugging. I will perform a full run of the parser to compare what we have downloaded with URLs guessed from rsync and URLs available in RDF as translated by your lib. I'll let you know. First try on one single book is OK (i.e. we could have used the RDF urls directly). |
I did a full run, looking for all formats (epub, html, pdf) and all languages. Only 59251 books have at least one of those formats. I'm downloading a full archive to confirm this number is ok, but it looks so. By the way, is there any plan to support those additional formats (text/plain + audio books)? Or any reason not to add them? For most books, the URLs (which we hint currently from rsync results + patterns) we download for epub, html, pdf and cover image are already present in the RDF files and are equal once converted through the small module mentioned above. The only exceptions are below. @eshellman could you have a look and confirm this should / could be fixed on your side?
Anyway, this is probably a significant first confirmation that your module works fine and that we can probably get rid of the rsync step and maybe other complexities for hinting appropriate file name for the various formats. I will continue to explore this by looking at how to select the appropriate file from the RDF for the three formats we currently support. |
The number 59251 is not ok, but it looks like I'm missing some urls from rsync ... I might have messed up with messy data ... I will run it once more with the whole rsync step to confirm, I might have missed some records. |
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
I've updated this issue title to better reflect the current state of the discussion here. |
|
Gutenberg scraping is pretty touchy and takes time to run. We should see if it can be simplified.
A start of this discussion has happend here #93 (comment)
The text was updated successfully, but these errors were encountered: