Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributor ingestion rate limit increased for retries due to ingestion failure #3804

Closed
pracucci opened this issue Feb 9, 2021 · 2 comments · Fixed by #3825
Closed

Distributor ingestion rate limit increased for retries due to ingestion failure #3804

pracucci opened this issue Feb 9, 2021 · 2 comments · Fixed by #3825

Comments

@pracucci
Copy link
Contributor

pracucci commented Feb 9, 2021

The distributor ingestion rate limit increases the number of "consumed tokens" in the rate limiter once the request is received and before writing to ingesters:

if !d.ingestionRateLimiter.AllowN(now, userID, totalN) {

In the event of an ingesters outage (eg. 2+ ingesters are unavailable), this means that each tenant remote write request will consume tokens from its rate limiter even if samples have not been successfully ingested. The client (eg. Prometheus) will retry writes and this will further consume tokens from the rate limiter, until it will eventually hit the rate limit, regardless any samples has been actually ingested.

The burst should protect from this, but in the event of a relatively long outage we would end up consuming the burst too (eg. we set burst to 10x the rate limit).

I'm wondering if a better approach would be checking if enough tokens are still available in the rate limiter once the request is received but actually consuming them from the rate limiter only after samples have been successfully written to ingesters. Due to concurrency, the actual accepted rate could be higher than the limit, but we would err in favour of the customer instead of rate limiting for writes we haven't actually ingested.

Related discussions:

@bboreham
Copy link
Contributor

bboreham commented Feb 9, 2021

I think the rate-limiter package has a solution for this.

Instead of calling AllowN(), call ReserveN() and check .OK() to see if within rate limit.
Then if the operation fails before ingestion call Cancel() on the Reservation.

@stevesg
Copy link
Contributor

stevesg commented Feb 15, 2021

It appears that ReserveN+OK is not directly equivalent to AllowN, but close enough that we can use it. AllowN essentially also performs a check to ensure the rate limit is complied with immediately, where ReserveN will return OK even if the tokens are not available until some delay has passed, so we just have to check this is zero if we want the same behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants