Increasing the threshold for a file lag and reducing the severity to warning #1749
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
After a year of playing with this alert and fixing race conditions, I think the races are fixed however the existing tolerance of 100kB is really much too small for busy promtail instances processing 5000lines/sec or more.
The original intent of this alert was to look for complete failures to tail a file and bugs in tailing which have since been squashed, therefore I think it's appropriate to increasing the threshold a file can fall behind to 1MB. This would still catch any new bugs in tailing or other issue but not be so flaky when log volume spikes on a promtail instance.
We also often find this alert is a red herring and sensitive to large bursts in log volume, and do not think it's appropriate to page someone when it fires. However it's still useful to know if this is happening as if nothing else it's going to indicate some delay in getting your logs to Loki.
The other reason to change it to warning is that it's hard to define an action anyone can take in an immediate sense when it fires which is another bar we hold for critial alerts.
Other alerts will catch the cases where Loki is not receiving logs at all.
Signed-off-by: Edward Welch [email protected]