Enable user to use .export for PDF download #87

dev-89 · 2021-11-23T00:31:18Z

Motivation

The arxiv library uses the .export.arxiv.org subdomain for querying a paper, but downloads the paper directly from arxiv.org. This can result in the problem that the user gets blocked from arxiv, when downloading too many papers.

Solution

A solution would be to modify the paper PDF url to point to the corresponding .export subdomain. In the code for my personal use I simply use:

idx = paper.pdf_url.index('arxiv')
paper.pdf_url = paper.pdf_url[:idx] + 'export.' + paper.pdf_url[idx:]

where paper is a Result instance. This solution is lacking though, since the export subdomain does not have to exist. This would need to be checked. I would add this functionality into the _get_pdf_url method. A boolean flag user_exportcould be introduced, if some users wish to download directy from arxiv.org, even though it is not adviced according to: https://arxiv.org/help/bulk_data under the "Play Nice" section.

The text was updated successfully, but these errors were encountered:

lukasschwab · 2021-11-25T11:14:02Z

Out of curiosity, did you run into rate-limiting yourself? Do you know when it kicked in (roughly)?

There's an export.arxiv.org record for every result from the API, so it should be safe to add the export subdomain before downloading, but it might be best to manage this with an optional flag in the download_pdf/download_source arguments.

We also need to confirm the download behavior when a PDF does not already exist for the export.arxiv.org record. In the browser, there's an intermediate "we're generating this PDF from source" page (screenshot below), then a redirect to the PDF once it's generated.

These cases must be handled gracefully.

brandonrobertz · 2022-10-18T15:57:11Z

I honestly think this library should default to using export.arxiv.org for everything, with an optional flag to use the non-robots allowed live site. First thing I did using this library was accidentally fetch a query that got me blocked from using arXiv for several hours. I bet a lot of users run into this, given the default values (default page size of 300000, for example, is enough to get one blocked).

lukasschwab · 2022-10-18T16:43:04Z

@brandonrobertz this library does use export.arxiv.org for everything except download URLs:

arxiv.py/arxiv/arxiv.py

Line 513 in 678ba9f

query_url_format = 'http://export.arxiv.org/api/query?{}'

The difference is that it receives download URLs from the API instead of building them.

Digression: let's chat limits.

default page size of 300000, for example, is enough to get one blocked

The default (Client).page_size is 100.

If you're interpreting the max_results limit in README.md, max_results isn't a page size; it's the maximum number of results across all pages for a search. If (Search).max_results = 300000 and (Client).page_size = 100, the client will make up to 3000 requests (iff there are ≥300,000 results available).

Maybe there should be a lower default.
Maybe there's a bug in the client code around delay_seconds. That delay between requests is meant to appease arXiv's rate limits, even for large queries.

Did you call (Result).download_pdf or (Result).download_source 300,000 times? If no, mind opening a separate issue to discuss your use case?

brandonrobertz · 2022-10-18T17:11:38Z

Interesting, sorry about the bad assumption, I didn't realize this used the export site. That's even more perplexing, then. And no I didn't call download_pdf 300k times. I got 403 after attempting to do results = arxiv.Search(query="cat:cs.LG").results()

I can open separate PR.

lukasschwab · 2022-10-18T19:54:58Z

@brandonrobertz No worries! Happy to advise.

dev-89 added the enhancement Requests for new features or improvements. label Nov 23, 2021

dev-89 assigned lukasschwab Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable user to use .export for PDF download #87

Enable user to use .export for PDF download #87

dev-89 commented Nov 23, 2021

lukasschwab commented Nov 25, 2021

brandonrobertz commented Oct 18, 2022

lukasschwab commented Oct 18, 2022

brandonrobertz commented Oct 18, 2022

lukasschwab commented Oct 18, 2022

Enable user to use .export for PDF download #87

Enable user to use .export for PDF download #87

Comments

dev-89 commented Nov 23, 2021

Motivation

Solution

lukasschwab commented Nov 25, 2021

brandonrobertz commented Oct 18, 2022

lukasschwab commented Oct 18, 2022

brandonrobertz commented Oct 18, 2022

lukasschwab commented Oct 18, 2022