Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promtail: Add a stream lagging metric #2618

Merged
merged 11 commits into from
Sep 18, 2020
Merged

Conversation

slim-bean
Copy link
Collaborator

@slim-bean slim-bean commented Sep 11, 2020

This is similar to what we accomplish with promtail_file_bytes_total - promtail_read_bytes_total which gives a difference between the current file size and how far Promtail has read into it.

There are problems with those metrics however, and unfortunately there are also problems with this approach too, but I think there are less problems with this approach and the output is more meaningful.

How it works:

Every time a successful batch is sent to Loki, we iterate through all the streams in the batch, take the most recent entry and subtract it from time.Now() and then report this as a gauge named promtail_stream_lag_seconds with a label for the filename.

Problems with this approach:

  • This metric adds very little value if promtail is assigning the timestamp when the line is read, it still might be useful but not as useful as having promtail extract the timestamp from the log line.
  • If you are extracting the log line timestamp and the timestamps are somehow old or delayed, then this metric will be skewed.
  • Slowly updated log files will show a lag when none really exists, i.e. a log file writes an entry and it took 3s to send it, that metric will be present and scraped until the next log line for that stream is processed and changes that value, however if no new log lines are written to that file the metric will continue to output 3s until it expires and auto removes after 1m or the file stops being tailed. This leads to some strange artifacts of 1m long flat lines in the output, we could crank the expiration to a much smaller value but then you would likely lose visibility into slowly written log lines which are actually way behind.

All that being said I do think this helps close the loop in normal circumstances for normal volume log streams on how long it takes from when they are written until we get a 200 indicating Loki has stored them.

NOTE this PR also changes the default batch size from 100kb to 1MB, it was discovered 100KB is too small for high volume log files which can generate log's at a rate of over 100kb/sec, meaning multiple batches per second were required and would cause Promtail to get behind. Increasing this to 1MB addresses this issue and does not affect slower streams which will still be sent based on the BatchWait setting of default 1s

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not super familiar with this code, but lgtm.

@@ -34,39 +37,43 @@ const (
// Label reserved to override the tenant ID while processing
// pipeline stages
ReservedLabelTenantID = "__tenant_id__"

LatencyLabel = "filename"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ho I see, I think we need a better name.

@codecov-commenter
Copy link

Codecov Report

Merging #2618 into master will decrease coverage by 0.19%.
The diff coverage is 28.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2618      +/-   ##
==========================================
- Coverage   63.03%   62.83%   -0.20%     
==========================================
  Files         169      169              
  Lines       15015    15050      +35     
==========================================
- Hits         9464     9457       -7     
- Misses       4792     4832      +40     
- Partials      759      761       +2     
Impacted Files Coverage Δ
pkg/logentry/metric/metricvec.go 75.00% <0.00%> (-21.43%) ⬇️
pkg/promtail/api/types.go 20.00% <ø> (ø)
pkg/promtail/targets/file/filetarget.go 64.40% <0.00%> (-1.88%) ⬇️
pkg/promtail/client/client.go 77.77% <40.00%> (-7.38%) ⬇️
pkg/promtail/positions/positions.go 46.49% <0.00%> (-13.16%) ⬇️

…ffect sending for lower volumes which are still driven by the BatchWait setting. Through the addition of this metric it was found that for higher volume files > 100kb/sec this was much too low causing too many batches to be unnecessarily sent.
@slim-bean slim-bean merged commit 0b1dbe2 into master Sep 18, 2020
@slim-bean slim-bean deleted the promtail-stream-lagging branch September 18, 2020 16:15
owen-d pushed a commit that referenced this pull request Jul 8, 2021
…-date defaults (#3559)

* Doc: Remove removed --ingester.recover-from-wal option

Removed in:
8a9b94a ("removes recover flag (#3326)")

* Doc: Fix out-of-date defaults

0b1dbe2 ("Promtail: Add a stream lagging metric (#2618)")
d3bf21e ("Promtail: (and also fluent-bit) change the max batch size to 1MB  (#2710)")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants