Distributor ingestion rate limit increased for retries due to ingestion failure #3804

pracucci · 2021-02-09T15:11:01Z

The distributor ingestion rate limit increases the number of "consumed tokens" in the rate limiter once the request is received and before writing to ingesters:

cortex/pkg/distributor/distributor.go

Line 581 in 527f9b5

if !d.ingestionRateLimiter.AllowN(now, userID, totalN) {

In the event of an ingesters outage (eg. 2+ ingesters are unavailable), this means that each tenant remote write request will consume tokens from its rate limiter even if samples have not been successfully ingested. The client (eg. Prometheus) will retry writes and this will further consume tokens from the rate limiter, until it will eventually hit the rate limit, regardless any samples has been actually ingested.

The burst should protect from this, but in the event of a relatively long outage we would end up consuming the burst too (eg. we set burst to 10x the rate limit).

I'm wondering if a better approach would be checking if enough tokens are still available in the rate limiter once the request is received but actually consuming them from the rate limiter only after samples have been successfully written to ingesters. Due to concurrency, the actual accepted rate could be higher than the limit, but we would err in favour of the customer instead of rate limiting for writes we haven't actually ingested.

Related discussions:

Distributor rate-limiter should call Wait() not Allow()

bboreham · 2021-02-09T15:18:43Z

I think the rate-limiter package has a solution for this.

Instead of calling AllowN(), call ReserveN() and check .OK() to see if within rate limit.
Then if the operation fails before ingestion call Cancel() on the Reservation.

stevesg · 2021-02-15T12:27:04Z

It appears that ReserveN+OK is not directly equivalent to AllowN, but close enough that we can use it. AllowN essentially also performs a check to ensure the rate limit is complied with immediately, where ReserveN will return OK even if the tokens are not available until some delay has passed, so we just have to check this is zero if we want the same behaviour.

pracucci added the component/distributor label Feb 9, 2021

stevesg mentioned this issue Feb 15, 2021

Prevent failed ingestion from affecting rate limiting in distributor. #3825

Merged

3 tasks

pracucci closed this as completed in #3825 Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributor ingestion rate limit increased for retries due to ingestion failure #3804

Distributor ingestion rate limit increased for retries due to ingestion failure #3804

pracucci commented Feb 9, 2021

bboreham commented Feb 9, 2021

stevesg commented Feb 15, 2021

Distributor ingestion rate limit increased for retries due to ingestion failure #3804

Distributor ingestion rate limit increased for retries due to ingestion failure #3804

Comments

pracucci commented Feb 9, 2021

bboreham commented Feb 9, 2021

stevesg commented Feb 15, 2021