Cortex should tell remote write clients to slow down on rate-limiting #2037

bboreham · 2020-01-27T09:42:13Z

Currently, when the ingest rate limit is hit for a tenant, Cortex goes immediately from accepting all data to dropping all data. The dropped data is lost permanently because the mechanism is to return a 500 error.

I propose Cortex should have an inbetween state where the return value indicates that the sender should slow down, but does not cause the data to be dropped. For instance it could return 429 - current Prometheus will retry immediately, but it could be enhanced to slow down. Further, Cortex could return an indication of how close to the limit the tenant is.

This idea is not original - I have seen similar in DynamoDB, CosmosDB and GitHub.

I think this would solve #837, probably better than #879.

csmarchbanks · 2020-01-27T14:46:38Z

I believe current Prometheus only retries on 5XX errors, though perhaps it should also retry 429.

stale · 2020-03-27T15:44:04Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

ranton256 · 2021-01-28T05:32:28Z

I have been doing some thinking about this and asked around for some other ideas, then after looking at the RFC that covers 429, I realize it actually has an optional header for “Retry-After” so that seems like at least one avenue we should consider.

The code to decide what errors are retryable was added in https://github.com/prometheus/prometheus/pull/2552/files and now is at https://github.com/prometheus/prometheus/blob/4e5b1722b342948a55b3d7753f6539040db0e5f0/storage/remote/client.go#L200 .
It only treats 5xx as retryable. I think we should consider changing it to look for the "Retry-After" header and set that from Cortex or other remote storage providers of remote write so that we could give it a hint when to retry or not without having to send some less appropriate status code than 429.

We could also consider something like X-RateLimit, https://tools.ietf.org/id/draft-polli-ratelimit-headers-00.html

bboreham · 2021-01-28T14:53:41Z

Agreed. Couple more points at #3654 (comment)

I couldn't see an existing issue in the Prometheus repo so opened prometheus/prometheus#8418

csmarchbanks · 2021-01-28T14:59:50Z

I left the link in the issue Bryan opened as well, but there is an open PR for adding the Retry-After functionality to Prometheus that @Harkishen-Singh has been working on: prometheus/prometheus#8237.

bboreham · 2021-03-01T16:02:57Z

The necessary part is now implemented in Prometheus prometheus/prometheus#8237
(and made optional in prometheus/prometheus#8477).

I'll leave this open until Cortex sends the retry-after header to make it slow down an appropriate amount.

mvadu · 2022-07-19T00:37:25Z

Is this still in roadmap?

bboreham mentioned this issue Mar 16, 2020

Wait for rate-limit instead of bailing out immediately #2276

Closed

3 tasks

stale bot added the stale label Mar 27, 2020

gouthamve added keepalive Skipped by stale bot and removed stale labels Mar 27, 2020

bboreham mentioned this issue Jan 6, 2021

Make distributor rate-limit return 5xx #3654

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cortex should tell remote write clients to slow down on rate-limiting #2037

Cortex should tell remote write clients to slow down on rate-limiting #2037

bboreham commented Jan 27, 2020

csmarchbanks commented Jan 27, 2020

stale bot commented Mar 27, 2020

ranton256 commented Jan 28, 2021

bboreham commented Jan 28, 2021

csmarchbanks commented Jan 28, 2021 •

edited

Loading

bboreham commented Mar 1, 2021

mvadu commented Jul 19, 2022

Cortex should tell remote write clients to slow down on rate-limiting #2037

Cortex should tell remote write clients to slow down on rate-limiting #2037

Comments

bboreham commented Jan 27, 2020

csmarchbanks commented Jan 27, 2020

stale bot commented Mar 27, 2020

ranton256 commented Jan 28, 2021

bboreham commented Jan 28, 2021

csmarchbanks commented Jan 28, 2021 • edited Loading

bboreham commented Mar 1, 2021

mvadu commented Jul 19, 2022

csmarchbanks commented Jan 28, 2021 •

edited

Loading