metrics: expose fsync latency as prometheus histogram #2014

coolcom200 · 2022-10-17T17:23:56Z

Previously, the fsync latency was computed using an HDR histogram
however this histogram generated too many buckets of various widths. As
a result, a prometheus histogram was chosen instead. In addition, the
metrics were being accumulated and stored on each LogWriter and then
merged into the cumulative metrics on each WAL rotation. However, with
this new promethesus histogram the writes occur to the same histogram
and are separate from the lifetime of the WAL.

Additonally, remove all the callback listeners and configurations for
the callback based approach introduced in: 5fdb3ea

cockroach-teamcity · 2022-10-17T17:24:04Z

This change is

coolcom200

Reviewed 8 of 10 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions

db.go line 346 at r2 (raw file):

			// Can be nil.
			metrics struct {
				fsyncLatency prometheus.Histogram

Moved this outside of the record.LogWriterMetrics since LogWriterMetrics can be added and subtracted but doing so with this histogram doesn't make sense.

record/log_writer_test.go line 389 at r2 (raw file):

		Name:        "",
		Help:        "",
		ConstLabels: nil,

This needs to be deleted.

Code quote:

		Namespace:   "",
		Subsystem:   "",
		Name:        "",
		Help:        "",
		ConstLabels: nil,

record/log_writer_test.go line 431 at r2 (raw file):

}

func ValueAtQuantileWindowed(histogram *prometheusgo.Histogram, q float64) float64 {

The ValueAtQuantileWindowed is from cockroachdb so I am not sure if this is a test that makes sense to perform the test here for checking for valid quantiles given that prometheus histograms don't support pulling these details by default.

Code quote:

	require.LessOrEqual(t, float64(syncLatency/(2*time.Microsecond)),
		ValueAtQuantileWindowed(writeTo.Histogram, 90))
	require.LessOrEqual(t, int64(syncLatency/2), int64(m.WriteThroughput.WorkDuration))
}

func ValueAtQuantileWindowed(histogram *prometheusgo.Histogram, q float64) float64 {

jbowens

Reviewed 2 of 10 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @coolcom200)

db.go line 346 at r2 (raw file):

Previously, coolcom200 (Leon Fattakhov) wrote…

Moved this outside of the record.LogWriterMetrics since LogWriterMetrics can be added and subtracted but doing so with this histogram doesn't make sense.

Ack, makes sense

metrics.go line 247 at r2 (raw file):

	// FsyncLatencyBuckets are prometheus histogram buckets suitable for a histogram
	// that records latencies for fsyncs.
	FsyncLatencyBuckets = prometheus.LinearBuckets(0, float64(100*time.Microsecond), 50)

is there an exponential bucket division that maintains at most 100-200u divisions between [0,5ms]? with a max of say ~10s.

open.go line 428 at r2 (raw file):

		logWriterConfig := record.LogWriterConfig{
			WALMinSyncInterval: d.opts.WALMinSyncInterval,
			WALFsync:           d.mu.log.metrics.fsyncLatency,

nit: how about WALFsyncLatency

record/log_writer.go line 455 at r2 (raw file):

		f.Lock()
		if synced {
			f.fsyncLatency.Observe(float64(syncLatency))

can we keep the nil check? eg, if synced && f.fsyncLatency != nil

it's convenient for tests that don't need the histogram, and for other uses of the LogWriter outside of Pebble that also don't need the histogram. Cockroach uses the LogWriter for its encryption-at-rest file registry

coolcom200

Reviewed 5 of 7 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @jbowens)

metrics.go line 247 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is there an exponential bucket division that maintains at most 100-200u divisions between [0,5ms]? with a max of say ~10s.

This is possible as there are the following helpful functions:

ExponentialBucketsRange(min, max float64, count int)
ExponentialBuckets(start, factor float64, count int)

With a bit of fiddling we can fine tune the amount of buckets and the sizes. The main question here is what should the start/min value be? 100us? (as in the first bucket is 0 - 100us and the rest follow the distribution)

record/log_writer.go line 455 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

can we keep the nil check? eg, if synced && f.fsyncLatency != nil

it's convenient for tests that don't need the histogram, and for other uses of the LogWriter outside of Pebble that also don't need the histogram. Cockroach uses the LogWriter for its encryption-at-rest file registry

Done.

record/log_writer_test.go line 389 at r2 (raw file):

Previously, coolcom200 (Leon Fattakhov) wrote…

This needs to be deleted.

Done.

record/log_writer_test.go line 431 at r2 (raw file):

Previously, coolcom200 (Leon Fattakhov) wrote…

The ValueAtQuantileWindowed is from cockroachdb so I am not sure if this is a test that makes sense to perform the test here for checking for valid quantiles given that prometheus histograms don't support pulling these details by default.

@jbowens Do you have any thoughts on this?

jbowens

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @coolcom200)

metrics.go line 247 at r2 (raw file):

Previously, coolcom200 (Leon Fattakhov) wrote…

This is possible as there are the following helpful functions:

ExponentialBucketsRange(min, max float64, count int)

ExponentialBuckets(start, factor float64, count int)

With a bit of fiddling we can fine tune the amount of buckets and the sizes. The main question here is what should the start/min value be? 100us? (as in the first bucket is 0 - 100us and the rest follow the distribution)

100u is a good starting bucket, but I imagine we'll need to experiment with adjusting start/factor values to get the right distribution.

coolcom200

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jbowens)

metrics.go line 247 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

100u is a good starting bucket, but I imagine we'll need to experiment with adjusting start/factor values to get the right distribution.

Decided to go with Linear Buckets for the first 0 - 5ms and then exponential buckets from 5ms onwards

Previously, the fsync latency was computed using an HDR histogram however this histogram generated too many buckets of various widths. As a result, a prometheus histogram was chosen instead. In addition, the metrics were being accumulated and stored on each `LogWriter` and then merged into the cumulative metrics on each WAL rotation. However, with this new promethesus histogram the writes occur to the same histogram and are separate from the lifetime of the WAL. Additonally, remove all the callback listeners and configurations for the callback based approach introduced in: 5fdb3ea

jbowens

Reviewed 2 of 10 files at r1, 4 of 7 files at r3, 1 of 2 files at r4, 1 of 2 files at r5, all commit messages.
Reviewable status: 9 of 10 files reviewed, 1 unresolved discussion (waiting on @coolcom200)

coolcom200 · 2022-10-19T14:10:41Z

TFTR!

90082: metrics: expose pebble fsync latency r=jbowens,tbg a=coolcom200 Given that pebble produces fsync latency as a prometheus histogram, create a new histogram `ManualWindowHistogram` that implements windowing to ensure that the data can be exported in Cockroach DB. This histogram does not collect values over time and expose them, instead, it allows for the cumulative and windowed metrics to be replaced. This means that the client is responsible for managing the replacement of the data to ensure that it is exported properly. Additionally, introduce a new struct `MetricsForInterval` to store the previous metrics. In order to perform subtraction between histograms they need to be converted to `prometheusgo.Metric`s which means that they cannot be stored on the `pebble.Metrics`. Hence this means that either these metrics need to be added to `storage.Metrics` which is confusing since that means there are now two metrics that represent the same data in different formats: ```go storage.pebble.Metrics.FsyncLatency prometheus.Histogram storage.FsyncLatency prometheusgo.Metric ``` Or a new struct is created that will contain the metrics needed to compute the metrics over an interval. The new struct was chosen since it was easier to understand. Release note: None Depends on: #89459 cockroachdb/pebble#2014 Epic: CRDB-17515 Co-authored-by: Leon Fattakhov <[email protected]>

coolcom200 force-pushed the 85755-fsync-latency-prometheus branch from 86f3e92 to ab2d6a8 Compare October 17, 2022 17:40

coolcom200 commented Oct 17, 2022

View reviewed changes

coolcom200 mentioned this pull request Oct 17, 2022

metrics: expose pebble fsync latency cockroachdb/cockroach#90082

Merged

coolcom200 marked this pull request as ready for review October 17, 2022 19:07

coolcom200 requested review from a team and jbowens October 17, 2022 19:08

jbowens reviewed Oct 17, 2022

View reviewed changes

coolcom200 force-pushed the 85755-fsync-latency-prometheus branch 3 times, most recently from 6bfa06a to a6d9a65 Compare October 17, 2022 22:56

coolcom200 commented Oct 17, 2022

View reviewed changes

jbowens reviewed Oct 18, 2022

View reviewed changes

coolcom200 force-pushed the 85755-fsync-latency-prometheus branch from a6d9a65 to 058a32c Compare October 18, 2022 19:14

coolcom200 commented Oct 18, 2022

View reviewed changes

coolcom200 force-pushed the 85755-fsync-latency-prometheus branch from 058a32c to fbe8309 Compare October 18, 2022 19:17

jbowens approved these changes Oct 19, 2022

View reviewed changes

coolcom200 merged commit 0090519 into cockroachdb:master Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: expose fsync latency as prometheus histogram #2014

metrics: expose fsync latency as prometheus histogram #2014

coolcom200 commented Oct 17, 2022 •

edited

Loading

cockroach-teamcity commented Oct 17, 2022

coolcom200 left a comment

jbowens left a comment

coolcom200 left a comment

jbowens left a comment

coolcom200 left a comment

jbowens left a comment

coolcom200 commented Oct 19, 2022

metrics: expose fsync latency as prometheus histogram #2014

metrics: expose fsync latency as prometheus histogram #2014

Conversation

coolcom200 commented Oct 17, 2022 • edited Loading

cockroach-teamcity commented Oct 17, 2022

coolcom200 left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

coolcom200 left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

coolcom200 left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

coolcom200 commented Oct 19, 2022

coolcom200 commented Oct 17, 2022 •

edited

Loading