-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promtail: Add a stream lagging metric #2618
Conversation
… just use filename.
…ove this in the future if the need arises
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with this code, but lgtm.
@@ -34,39 +37,43 @@ const ( | |||
// Label reserved to override the tenant ID while processing | |||
// pipeline stages | |||
ReservedLabelTenantID = "__tenant_id__" | |||
|
|||
LatencyLabel = "filename" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ho I see, I think we need a better name.
Codecov Report
@@ Coverage Diff @@
## master #2618 +/- ##
==========================================
- Coverage 63.03% 62.83% -0.20%
==========================================
Files 169 169
Lines 15015 15050 +35
==========================================
- Hits 9464 9457 -7
- Misses 4792 4832 +40
- Partials 759 761 +2
|
…ffect sending for lower volumes which are still driven by the BatchWait setting. Through the addition of this metric it was found that for higher volume files > 100kb/sec this was much too low causing too many batches to be unnecessarily sent.
…-date defaults (#3559) * Doc: Remove removed --ingester.recover-from-wal option Removed in: 8a9b94a ("removes recover flag (#3326)") * Doc: Fix out-of-date defaults 0b1dbe2 ("Promtail: Add a stream lagging metric (#2618)") d3bf21e ("Promtail: (and also fluent-bit) change the max batch size to 1MB (#2710)")
This is similar to what we accomplish with
promtail_file_bytes_total - promtail_read_bytes_total
which gives a difference between the current file size and how far Promtail has read into it.There are problems with those metrics however, and unfortunately there are also problems with this approach too, but I think there are less problems with this approach and the output is more meaningful.
How it works:
Every time a successful batch is sent to Loki, we iterate through all the streams in the batch, take the most recent entry and subtract it from time.Now() and then report this as a gauge named
promtail_stream_lag_seconds
with a label for the filename.Problems with this approach:
All that being said I do think this helps close the loop in normal circumstances for normal volume log streams on how long it takes from when they are written until we get a 200 indicating Loki has stored them.
NOTE this PR also changes the default batch size from 100kb to 1MB, it was discovered 100KB is too small for high volume log files which can generate log's at a rate of over 100kb/sec, meaning multiple batches per second were required and would cause Promtail to get behind. Increasing this to 1MB addresses this issue and does not affect slower streams which will still be sent based on the BatchWait setting of default 1s