Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cortex should tell remote write clients to slow down on rate-limiting #2037

Open
bboreham opened this issue Jan 27, 2020 · 7 comments
Open
Labels
keepalive Skipped by stale bot

Comments

@bboreham
Copy link
Contributor

Currently, when the ingest rate limit is hit for a tenant, Cortex goes immediately from accepting all data to dropping all data. The dropped data is lost permanently because the mechanism is to return a 500 error.

I propose Cortex should have an inbetween state where the return value indicates that the sender should slow down, but does not cause the data to be dropped. For instance it could return 429 - current Prometheus will retry immediately, but it could be enhanced to slow down. Further, Cortex could return an indication of how close to the limit the tenant is.

This idea is not original - I have seen similar in DynamoDB, CosmosDB and GitHub.

I think this would solve #837, probably better than #879.

@csmarchbanks
Copy link
Contributor

I believe current Prometheus only retries on 5XX errors, though perhaps it should also retry 429.

@stale
Copy link

stale bot commented Mar 27, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 27, 2020
@gouthamve gouthamve added keepalive Skipped by stale bot and removed stale labels Mar 27, 2020
@ranton256
Copy link
Contributor

I have been doing some thinking about this and asked around for some other ideas, then after looking at the RFC that covers 429, I realize it actually has an optional header for “Retry-After” so that seems like at least one avenue we should consider.

The code to decide what errors are retryable was added in https://github.com/prometheus/prometheus/pull/2552/files and now is at https://github.com/prometheus/prometheus/blob/4e5b1722b342948a55b3d7753f6539040db0e5f0/storage/remote/client.go#L200 .
It only treats 5xx as retryable. I think we should consider changing it to look for the "Retry-After" header and set that from Cortex or other remote storage providers of remote write so that we could give it a hint when to retry or not without having to send some less appropriate status code than 429.

We could also consider something like X-RateLimit, https://tools.ietf.org/id/draft-polli-ratelimit-headers-00.html

@bboreham
Copy link
Contributor Author

Agreed. Couple more points at #3654 (comment)

I couldn't see an existing issue in the Prometheus repo so opened prometheus/prometheus#8418

@csmarchbanks
Copy link
Contributor

csmarchbanks commented Jan 28, 2021

I left the link in the issue Bryan opened as well, but there is an open PR for adding the Retry-After functionality to Prometheus that @Harkishen-Singh has been working on: prometheus/prometheus#8237.

@bboreham
Copy link
Contributor Author

bboreham commented Mar 1, 2021

The necessary part is now implemented in Prometheus prometheus/prometheus#8237
(and made optional in prometheus/prometheus#8477).

I'll leave this open until Cortex sends the retry-after header to make it slow down an appropriate amount.

@mvadu
Copy link

mvadu commented Jul 19, 2022

Is this still in roadmap?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keepalive Skipped by stale bot
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants