Rate limiting isn't actually implemented correctly #80

glennpow · 2020-09-06T17:47:08Z

I was getting rate-limit errors, even though I'd specified a rate_limit_per_minute arg of 1, so I looked into the RateLimitCache class. I see that the new() method is actually never getting called, so this class cannot be performing as expected. I added the following two lines to the bottom of the _impose_rate_limit method, and it seems to (temporarily) fix the problem.

        if  hasattr(self, '_rlcache'):
            self._rlcache.new()

However, this is not even the proper solution. The rate-limit will still kick in, since the API is actually limited to 1 request/sec (as opposed to 60 requests per 60 seconds). This implementation will send a few requests as fast as possible, then error out. The correct implementation will limit the number of requests per second, rather than per minute.

The text was updated successfully, but these errors were encountered:

glennpow · 2020-09-06T21:37:32Z

I do have a PR to fix this now: #81

dmarx · 2020-10-02T18:51:02Z

Interesting... I'm not opposed to changing the rate limit system, but I think this needs consultation/coordination with Pushshift (i.e. Jason).

The rate limit isn't actually a fixed value. Sometimes Jason throttles requests because the server is being DDOSed. Sometimes he grants specific users higher priority access. To solve for this, PSAW hits the /meta endpoint on initialization to ask pushshift directly how it should handle throttling. Here's the response I'm seeing right now:

{
    "client_accepts_json": true,
    "client_request_headers": {
        "ACCEPT": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "ACCEPT-ENCODING": "gzip",
        "ACCEPT-LANGUAGE": "en-US,en;q=0.9",
        "CDN-LOOP": "cloudflare",
        "CF-CONNECTING-IP": "67.183.220.19",
        "CF-IPCOUNTRY": "US",
        "CF-RAY": "5dc08b0acded02b8-SEA",
        "CF-REQUEST-ID": "058c353abc000002b8b5888200000001",
        "CF-VISITOR": "{\"scheme\":\"https\"}",
        "CONNECTION": "close",
        "HOST": "api.pushshift.io",
        "SEC-FETCH-DEST": "document",
        "SEC-FETCH-MODE": "navigate",
        "SEC-FETCH-SITE": "none",
        "UPGRADE-INSECURE-REQUESTS": "1",
        "USER-AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36",
        "X-FORWARDED-FOR": "67.183.220.19, 67.183.220.19",
        "X-FORWARDED-PROTO": "https"
    },
    "client_user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36",
    "server_ratelimit_per_minute": 120,
    "source-ip": "67.183.220.19"
}

In particular, take note of the attribute server_ratelimit_per_minute.

PSAW sets it's rate limiting on a per-minute basis because according to the meta endpoint, so does pushshift. It sounds like there may be an additional per-second or per-k-seconds rate limit we should respect. If this is the case, I think the best way forward would be to add the per-second limit to the /meta endpoint so PSAW doesn't need to make assumption about what it's supposed to be.

That's interesting that you're observing a 60 requests/minute limit given that the server is still reporting 120. I wonder if maybe Jason changed the throttling without modifying the value reported by the /meta endpoint. If so, I'd recommend that he tie those things together so he doesn't need to manually change both separately.

@pushshift: thoughts?

glennpow · 2020-10-06T16:26:43Z

@dmarx Thanks for the response. Actually, I wasn't even able to use 60 req/min without getting a rate limit error. I'm currently using 30. I do agree that it would be ideal to use the meta endpoint to set this, but clearly the value of 120/min already doesn't match with what was posted publicly as 1/sec in this thread: https://www.reddit.com/r/pushshift/comments/g7125k/in_an_effort_to_relieve_some_of_the_strain_on_the/
As long as the meta endpoint is accurate, this lib could limit per minute, but I'm not sure how reliable that metadata is.
As far as removing the n parameter to RateLimitCache (and only having interval), you can currently achieve the same effect of something like (n=2, interval=1) by specifying just (interval=0.5), which is what would happen if you specified rate_limit_per_minute=120 to PushshiftAPI (which is also what it would use now by default if pulled from /meta).

dmarx mentioned this issue Oct 2, 2020

Refactored RateLimitCache to work on a per second basis, rather than per minute #81

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate limiting isn't actually implemented correctly #80

Rate limiting isn't actually implemented correctly #80

glennpow commented Sep 6, 2020 •

edited

Loading

glennpow commented Sep 6, 2020

dmarx commented Oct 2, 2020 •

edited

Loading

glennpow commented Oct 6, 2020

Rate limiting isn't actually implemented correctly #80

Rate limiting isn't actually implemented correctly #80

Comments

glennpow commented Sep 6, 2020 • edited Loading

glennpow commented Sep 6, 2020

dmarx commented Oct 2, 2020 • edited Loading

glennpow commented Oct 6, 2020

glennpow commented Sep 6, 2020 •

edited

Loading

dmarx commented Oct 2, 2020 •

edited

Loading