-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: expose fsync latency as prometheus histogram #2014
metrics: expose fsync latency as prometheus histogram #2014
Conversation
86f3e92
to
ab2d6a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 10 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions
db.go
line 346 at r2 (raw file):
// Can be nil. metrics struct { fsyncLatency prometheus.Histogram
Moved this outside of the record.LogWriterMetrics
since LogWriterMetrics
can be added and subtracted but doing so with this histogram doesn't make sense.
record/log_writer_test.go
line 389 at r2 (raw file):
Name: "", Help: "", ConstLabels: nil,
This needs to be deleted.
Code quote:
Namespace: "",
Subsystem: "",
Name: "",
Help: "",
ConstLabels: nil,
record/log_writer_test.go
line 431 at r2 (raw file):
} func ValueAtQuantileWindowed(histogram *prometheusgo.Histogram, q float64) float64 {
The ValueAtQuantileWindowed
is from cockroachdb so I am not sure if this is a test that makes sense to perform the test here for checking for valid quantiles given that prometheus histograms don't support pulling these details by default.
Code quote:
require.LessOrEqual(t, float64(syncLatency/(2*time.Microsecond)),
ValueAtQuantileWindowed(writeTo.Histogram, 90))
require.LessOrEqual(t, int64(syncLatency/2), int64(m.WriteThroughput.WorkDuration))
}
func ValueAtQuantileWindowed(histogram *prometheusgo.Histogram, q float64) float64 {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 10 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @coolcom200)
db.go
line 346 at r2 (raw file):
Previously, coolcom200 (Leon Fattakhov) wrote…
Moved this outside of the
record.LogWriterMetrics
sinceLogWriterMetrics
can be added and subtracted but doing so with this histogram doesn't make sense.
Ack, makes sense
metrics.go
line 247 at r2 (raw file):
// FsyncLatencyBuckets are prometheus histogram buckets suitable for a histogram // that records latencies for fsyncs. FsyncLatencyBuckets = prometheus.LinearBuckets(0, float64(100*time.Microsecond), 50)
is there an exponential bucket division that maintains at most 100-200u divisions between [0,5ms]? with a max of say ~10s.
open.go
line 428 at r2 (raw file):
logWriterConfig := record.LogWriterConfig{ WALMinSyncInterval: d.opts.WALMinSyncInterval, WALFsync: d.mu.log.metrics.fsyncLatency,
nit: how about WALFsyncLatency
record/log_writer.go
line 455 at r2 (raw file):
f.Lock() if synced { f.fsyncLatency.Observe(float64(syncLatency))
can we keep the nil check? eg, if synced && f.fsyncLatency != nil
it's convenient for tests that don't need the histogram, and for other uses of the LogWriter outside of Pebble that also don't need the histogram. Cockroach uses the LogWriter for its encryption-at-rest file registry
6bfa06a
to
a6d9a65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 7 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @jbowens)
metrics.go
line 247 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
is there an exponential bucket division that maintains at most 100-200u divisions between [0,5ms]? with a max of say ~10s.
This is possible as there are the following helpful functions:
ExponentialBucketsRange(min, max float64, count int)
ExponentialBuckets(start, factor float64, count int)
With a bit of fiddling we can fine tune the amount of buckets and the sizes. The main question here is what should the start
/min
value be? 100us? (as in the first bucket is 0 - 100us and the rest follow the distribution)
record/log_writer.go
line 455 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
can we keep the nil check? eg,
if synced && f.fsyncLatency != nil
it's convenient for tests that don't need the histogram, and for other uses of the LogWriter outside of Pebble that also don't need the histogram. Cockroach uses the LogWriter for its encryption-at-rest file registry
Done.
record/log_writer_test.go
line 389 at r2 (raw file):
Previously, coolcom200 (Leon Fattakhov) wrote…
This needs to be deleted.
Done.
record/log_writer_test.go
line 431 at r2 (raw file):
Previously, coolcom200 (Leon Fattakhov) wrote…
The
ValueAtQuantileWindowed
is from cockroachdb so I am not sure if this is a test that makes sense to perform the test here for checking for valid quantiles given that prometheus histograms don't support pulling these details by default.
@jbowens Do you have any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @coolcom200)
metrics.go
line 247 at r2 (raw file):
Previously, coolcom200 (Leon Fattakhov) wrote…
This is possible as there are the following helpful functions:
ExponentialBucketsRange(min, max float64, count int)
ExponentialBuckets(start, factor float64, count int)
With a bit of fiddling we can fine tune the amount of buckets and the sizes. The main question here is what should the
start
/min
value be? 100us? (as in the first bucket is 0 - 100us and the rest follow the distribution)
100u is a good starting bucket, but I imagine we'll need to experiment with adjusting start/factor values to get the right distribution.
a6d9a65
to
058a32c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jbowens)
metrics.go
line 247 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
100u is a good starting bucket, but I imagine we'll need to experiment with adjusting start/factor values to get the right distribution.
Decided to go with Linear Buckets for the first 0 - 5ms and then exponential buckets from 5ms onwards
Previously, the fsync latency was computed using an HDR histogram however this histogram generated too many buckets of various widths. As a result, a prometheus histogram was chosen instead. In addition, the metrics were being accumulated and stored on each `LogWriter` and then merged into the cumulative metrics on each WAL rotation. However, with this new promethesus histogram the writes occur to the same histogram and are separate from the lifetime of the WAL. Additonally, remove all the callback listeners and configurations for the callback based approach introduced in: 5fdb3ea
058a32c
to
fbe8309
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 10 files at r1, 4 of 7 files at r3, 1 of 2 files at r4, 1 of 2 files at r5, all commit messages.
Reviewable status: 9 of 10 files reviewed, 1 unresolved discussion (waiting on @coolcom200)
TFTR! |
90082: metrics: expose pebble fsync latency r=jbowens,tbg a=coolcom200 Given that pebble produces fsync latency as a prometheus histogram, create a new histogram `ManualWindowHistogram` that implements windowing to ensure that the data can be exported in Cockroach DB. This histogram does not collect values over time and expose them, instead, it allows for the cumulative and windowed metrics to be replaced. This means that the client is responsible for managing the replacement of the data to ensure that it is exported properly. Additionally, introduce a new struct `MetricsForInterval` to store the previous metrics. In order to perform subtraction between histograms they need to be converted to `prometheusgo.Metric`s which means that they cannot be stored on the `pebble.Metrics`. Hence this means that either these metrics need to be added to `storage.Metrics` which is confusing since that means there are now two metrics that represent the same data in different formats: ```go storage.pebble.Metrics.FsyncLatency prometheus.Histogram storage.FsyncLatency prometheusgo.Metric ``` Or a new struct is created that will contain the metrics needed to compute the metrics over an interval. The new struct was chosen since it was easier to understand. Release note: None Depends on: #89459 cockroachdb/pebble#2014 Epic: CRDB-17515 Co-authored-by: Leon Fattakhov <[email protected]>
Previously, the fsync latency was computed using an HDR histogram
however this histogram generated too many buckets of various widths. As
a result, a prometheus histogram was chosen instead. In addition, the
metrics were being accumulated and stored on each
LogWriter
and thenmerged into the cumulative metrics on each WAL rotation. However, with
this new promethesus histogram the writes occur to the same histogram
and are separate from the lifetime of the WAL.
Additonally, remove all the callback listeners and configurations for
the callback based approach introduced in: 5fdb3ea