-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promtail: Retry 429 rate limit errors from Loki, increase default retry limits #1840
Conversation
… configuring multiple client sections in promtail, also increased the backoff and retry settings in promtail. Signed-off-by: Edward Welch <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, not sure what’s contentious here.
However, I’d probably start the minimum retry at 1s. Maybe increase the maximum backoff, too, assuming we won’t overbuffer in memory (do we have backpressure so we won’t continue reading files when we can’t push?)
We definitely seem vulnerable to network partitions, but that can't be helped without some sort of WAL and I don't want to go down that route (at least not yet).
flag.DurationVar(&c.BackoffConfig.MinBackoff, "client.min-backoff", 100*time.Millisecond, "Initial backoff time between retries.") | ||
flag.DurationVar(&c.BackoffConfig.MaxBackoff, "client.max-backoff", 5*time.Second, "Maximum backoff time between retries.") | ||
// Default backoff schedule: 0.5s, 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s(4.267m) For a total time of 511.5s(8.5m) before logs are lost | ||
flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retires when sending batches.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retires when sending batches.") | |
flag.IntVar(&c.BackoffConfig.MaxRetries, "client.max-retries", 10, "Maximum number of retries when sending batches.") |
@@ -68,6 +68,11 @@ Supported contents and default values of `config.yaml`: | |||
|
|||
# Describes how Promtail connects to multiple instances | |||
# of Loki, sending logs to each. | |||
# WARNING: If one of the remote Loki servers fails to respond or responds | |||
# with any error which is retriable, this will impact sending logs to any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# with any error which is retriable, this will impact sending logs to any | |
# with any error which is retryable, this will impact sending logs to any |
I went with 0.5s because our batch send interval is 1s, this would give at least one retry before the next batch would be ready to send, in case of minor interruptions to the internet, etc. With the current settings of 10 retries we would never get to the max backoff (although close). And yeah basically everything on the read to send path is synchronous, we do use a channel from each reader to send to the batch however it's a non-buffered channel so it blocks while a batch is being sent which in turn blocks the readers from reading more from the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, However I'm not sure if docker driver and fluentbit will uses those default worth double checking how they are setup.
From what I could see they do not override these defaults anywhere, or in any of the example config files so I think we are good to go here. |
Promtail version: latest |
Currently promtail will only retry 500 errors from Loki, but we send rate limit errors as 429's.
There has been discussion about this behavior a few times with the current implemenation basically following this logic:
If a client is sending so many logs that the server is rate limiting, retrying only makes the problem worse as now you have the original volume plus retry volume.
Through discussion there are other cases which are valid and would benefit from retrying 429's, such as longer rate overage than what our burst limit allows, or rate limit recovering from a Loki server being down.
At any rate this change mostly just moves where the logs get dropped if you hit rate limits and are never succesful in sending below the threshold.
Now the behavior will be to sit and retry sending a batch while reading from the log file stalls, if the 429's clear promtail should be able to catch up and send all logs.
If the 429's are not cleared eventually the underlying file will roll and when promtail reads again it will miss what was in the rolled file (with some caveats we do try to send one last time from a rolled file however this may or may not succeed based on the response from the server)
This PR also introduces some larger backoff and retry defaults in promtail allowing up to about 8.5mins of attempts before giving up on the batch and discarding it.
Signed-off-by: Edward Welch [email protected]