-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cortex should tell remote write clients to slow down on rate-limiting #2037
Comments
I believe current Prometheus only retries on 5XX errors, though perhaps it should also retry 429. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
I have been doing some thinking about this and asked around for some other ideas, then after looking at the RFC that covers 429, I realize it actually has an optional header for “Retry-After” so that seems like at least one avenue we should consider. The code to decide what errors are retryable was added in https://github.com/prometheus/prometheus/pull/2552/files and now is at https://github.com/prometheus/prometheus/blob/4e5b1722b342948a55b3d7753f6539040db0e5f0/storage/remote/client.go#L200 . We could also consider something like X-RateLimit, https://tools.ietf.org/id/draft-polli-ratelimit-headers-00.html |
Agreed. Couple more points at #3654 (comment) I couldn't see an existing issue in the Prometheus repo so opened prometheus/prometheus#8418 |
I left the link in the issue Bryan opened as well, but there is an open PR for adding the Retry-After functionality to Prometheus that @Harkishen-Singh has been working on: prometheus/prometheus#8237. |
The necessary part is now implemented in Prometheus prometheus/prometheus#8237 I'll leave this open until Cortex sends the |
Is this still in roadmap? |
Currently, when the ingest rate limit is hit for a tenant, Cortex goes immediately from accepting all data to dropping all data. The dropped data is lost permanently because the mechanism is to return a 500 error.
I propose Cortex should have an inbetween state where the return value indicates that the sender should slow down, but does not cause the data to be dropped. For instance it could return 429 - current Prometheus will retry immediately, but it could be enhanced to slow down. Further, Cortex could return an indication of how close to the limit the tenant is.
This idea is not original - I have seen similar in DynamoDB, CosmosDB and GitHub.
I think this would solve #837, probably better than #879.
The text was updated successfully, but these errors were encountered: