Investigating: arXiv API flakiness #129

lukasschwab · 2023-10-15T23:31:17Z

Description

A clear and concise description of what the bug is.

The arXiv API seems to be degraded. I expect to see more bug reports about this until the underlying issue is resolved.

Behavior identified in #43 seems to have intensified or changed in character (e.g. increased clustering, such that retries are more likely to re-fail, perhaps because of cached bad responses).

Why can't you fix the API?
: I'm not affiliated with arXiv — I maintain a wrapper library for an API I don't administer. I've written the arxiv-api Google Group about this issue.

Why aren't you merging bug fixes?
: Some of the proposed changes here (e.g. consolidating on HTTPS, pinning a specific feedparser version, etc.) are probably good changes regardless of the API's stability. I'm hesitant to rush merging and releasing changes without having a strong sense, through integration tests, that they don't damage this library's behavior. That judgment is subject to change, esp. if this issue persists.

Steps to reproduce

Steps to reproduce the behavior; ideally, include a code snippet.

Differing results between HTTPS and HTTP calls.
Unexpectedly empty pages of results, per Unreliable results: pages from API are unexpectedly empty #43.
This is reflected in integration test CI instability, both on master and on various feature branches.

Versions

python version: independent.

arxiv.py version: 1.*.*.

Additional context

Add any other context about the problem here.

PRs directly addressing the instability:

The text was updated successfully, but these errors were encountered:

lukasschwab · 2023-10-16T03:01:09Z

Good example of flakiness between identical versions/protocols: #132 (comment)

liyucheng09 · 2023-10-16T13:41:15Z

Good diagnosis for this issue. I guess there is not too much we can do unless they fix the backend.

liyucheng09 · 2023-10-16T16:32:36Z

BTW I found arxiv treats requests differently for programatic clients and real browsers. I suspect this flakiness is on purpose.

lukasschwab · 2023-10-16T16:45:07Z

@liyucheng09 can you share any details on that investigation? In #127 I tried tweaking the user-agent.

liyucheng09 · 2023-10-16T17:41:30Z

I tried about 300 attempts hourly today. More than 3000 in total. 0 out of 3000 suceeded.
By sending a user-agent to the feedparser, 28 out of 100 suceeded.
I suppose we could safely say arxiv is declining requests from programmatic clients.

Ar4ikov · 2023-10-17T00:04:43Z

Hello! feedparser that needs to arxiv lib works contains that... I really can't describe my emotions, when I'd seen that first time. (feedparser/init.py)

USER_AGENT = "feedparser/%s +https://github.com/kurtmckee/feedparser/" % __version__

Does the developer find out this funny?

Instead of using normally worked application, I need to cp -r /path/to/site-packages/feedparser /path/to/my-project-dir/, change USER_AGENT to my real and finally! ArXiv API works 100 times of 100.

It will be much MUCH better, if feedparser will use something like that:

from os import environ

# <...>
USER_AGENT = environ.get('PYTHON_FEEDPARSER_USER_AGENT', "feedparser/%s +https://github.com/kurtmckee/feedparser/" % __version__)  # thank you for you joke, I I throw to the garbage myself and my 2 days for running my project that use langchain and ArXiVLoader

lukasschwab · 2023-10-17T00:40:56Z

@Ar4ikov I believe all currently-released versions of feedparser support specifying the User-Agent header through a named parameter (agent) to feedparser.parse, but — to your point — this package neither overrides the default nor exposes a way to set it.

I think the most robust change is to make the HTTP calls from arxiv (e.g. with requests), then pass the body to feedparser for parsing.

Nonetheless, my testing hasn't shown that updating the user agent makes the tests pass 100% of the time. Still searching, but I'll investigate this angle more.

Update: I published the major version release.

If you find any issues with the new version unrelated to the API instability, please open separate issues for those! I rolled this release in a hurry.

lukasschwab · 2023-10-18T19:10:39Z

The API seems much more stable now than it was over the weekend. CI is consistently succeeding locally.

I'm going to close this issue for the time being. I'll reopen it in the future if I see similar instability (increased rate of unexpectedly empty first pages, ConnectionReset errors).

jaypantone · 2024-03-19T15:55:55Z

I know this is closed, but I just wanted to add that over the last week or two I have started to experience this issue. The API calls occasionally return empty results erroneously.

lukasschwab · 2024-03-19T16:28:01Z

@jaypantone yeah, lots of inbound issues about this. I don't work for arXiv, so I can't affect a change there directly.

Don't overload them with requests, but you might consider describing your issue on the arXiv mailing list:

The thread I linked from the initial Investigating: arXiv API flakiness #129 issue message: https://groups.google.com/g/arxiv-api/c/DYHxWrtBgbo
This more recent thread, which I assume is about the ongoing incident: https://groups.google.com/g/arxiv-api/c/cIc8LYsQY20

I've pinned this issue in the hopes that more people find it rather than creating new ones.

lukasschwab added bug Deviations from documented behavior. api Issues that correspond to arXiv API behavior rather than behavior introduced by this wrapper. labels Oct 15, 2023

lukasschwab self-assigned this Oct 15, 2023

lukasschwab pinned this issue Oct 16, 2023

This was referenced Oct 16, 2023

Standardize on HTTPS over HTTP #131

Merged

Improve logging: standard formatting, verbose tests #132

Merged

lukasschwab mentioned this issue Oct 16, 2023

Extreme noob encountering error while fetching arxiv api #134

Closed

This was referenced Oct 17, 2023

Pin feedparser==6.0.6 #135

Merged

Separate HTTP request from feedparser.parse #136

Merged

Bump version to 2.0.0 #139

Merged

Unreliable results: pages from API are unexpectedly empty #43

Closed

lukasschwab closed this as completed Oct 18, 2023

lukasschwab unpinned this issue Oct 25, 2023

lukasschwab mentioned this issue Dec 21, 2023

arxiv.Search returns empty result #119

Closed

This was referenced Mar 18, 2024

Empty response #157

Closed

Inquiry about setting options for HTTP/HTTPS #156

Closed

Empty Response after few requests #158

Closed

lukasschwab pinned this issue Mar 19, 2024

bilalazh mentioned this issue Apr 16, 2024

bulk all download bilalazh/Arxiv-Research-Pooler#1

Closed

timsanders256 mentioned this issue May 17, 2024

[BUG]: 'arxiv' search result is unexpectedly empty ulab-uiuc/research-town#56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating: arXiv API flakiness #129

Investigating: arXiv API flakiness #129

lukasschwab commented Oct 15, 2023 •

edited

Loading

lukasschwab commented Oct 16, 2023

liyucheng09 commented Oct 16, 2023

liyucheng09 commented Oct 16, 2023

lukasschwab commented Oct 16, 2023

liyucheng09 commented Oct 16, 2023

Ar4ikov commented Oct 17, 2023 •

edited

Loading

lukasschwab commented Oct 17, 2023 •

edited

Loading

lukasschwab commented Oct 18, 2023

jaypantone commented Mar 19, 2024

lukasschwab commented Mar 19, 2024 •

edited

Loading

Investigating: arXiv API flakiness #129

Investigating: arXiv API flakiness #129

Comments

lukasschwab commented Oct 15, 2023 • edited Loading

Description

Steps to reproduce

Versions

Additional context

lukasschwab commented Oct 16, 2023

liyucheng09 commented Oct 16, 2023

liyucheng09 commented Oct 16, 2023

lukasschwab commented Oct 16, 2023

liyucheng09 commented Oct 16, 2023

Ar4ikov commented Oct 17, 2023 • edited Loading

lukasschwab commented Oct 17, 2023 • edited Loading

lukasschwab commented Oct 18, 2023

jaypantone commented Mar 19, 2024

lukasschwab commented Mar 19, 2024 • edited Loading

lukasschwab commented Oct 15, 2023 •

edited

Loading

Ar4ikov commented Oct 17, 2023 •

edited

Loading

lukasschwab commented Oct 17, 2023 •

edited

Loading

lukasschwab commented Mar 19, 2024 •

edited

Loading