Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metric: expose pebble fsync latency with a callback #2001

Merged

Conversation

coolcom200
Copy link
Contributor

The fsync latency was computed using an HDR histogram however this
histogram generates too many buckets of various widths. In order to
allow clients to store the fsync latency in their preferred format, a
callback was introduced. This callback is defined on the
pebble.Options which enables clients to provide a callback function
which will be triggered each time pebble produces the fsync latency
metrics.

Additionally the record.NewLogWriter method was updated to take in a
configuration struct which configures the LogWriter with the specified
callbacks.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@coolcom200 coolcom200 changed the title metrics: remove deprecated InternalIntervalMetrics method metric: expose pebble fsync latency with a callback Oct 11, 2022
Copy link
Contributor Author

@coolcom200 coolcom200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions


options.go line 762 at r2 (raw file):

	WALMinSyncInterval func() time.Duration

	OnMetrics MetricCallbacks

This needs a comment


record/log_writer.go line 307 at r2 (raw file):

// NewLogWriter returns a new LogWriter.
func NewLogWriter(w io.Writer, logNum base.FileNum, logWriterConfig LogWriterConfig) *LogWriter {

Should we move logNum into the logWriterConfig struct?


record/log_writer.go line 334 at r2 (raw file):

	f := &r.flusher
	f.minSyncInterval = logWriterConfig.WALMinSyncIntervalCallback
	f.onFsyncLatencyMetric = logWriterConfig.LogWriterFsyncLatencyCallback

This assignment used to be gaurded with a lock however it doesn't seem like this is needed in the NewLogWriter func

Code quote:

	f := &r.flusher
	f.minSyncInterval = logWriterConfig.WALMinSyncIntervalCallback
	f.onFsyncLatencyMetric = logWriterConfig.LogWriterFsyncLatencyCallback

@coolcom200 coolcom200 marked this pull request as ready for review October 11, 2022 13:56
@coolcom200 coolcom200 requested a review from jbowens October 11, 2022 13:57
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great! some nits here and there, mostly around naming

Reviewed 4 of 4 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @coolcom200)


options.go line 762 at r2 (raw file):

Previously, coolcom200 (Leon Fattakhov) wrote…

This needs a comment

And how about MetricEventListener for the field name as well. The fields of the struct are sorted alphabetically, so we should also slot it in appropriately


options.go line 801 at r2 (raw file):

// MetricCallbacks is a struct that contains the callbacks that pebble will call
// in order to update clients.
type MetricCallbacks struct {

how about MetricEventListener to mirror EventListener?


options.go line 802 at r2 (raw file):

// in order to update clients.
type MetricCallbacks struct {
	LogWriterFsyncLatency func(duration int64)

let's call this WALFysncLatency, and let's add a comment explaining that since this event is called on every WAL fsync, implementations must have very low overhead to avoid impacting performance.


record/log_writer.go line 230 at r2 (raw file):

type durationFunc func() time.Duration
type recordValueFunc func(value int64)

let's make this take a time.Duration now. then if the user subscribing to fsync events wants to truncate precision to the microsecond, they can call .Microsecond() themselves.


record/log_writer.go line 299 at r2 (raw file):

type LogWriterConfig struct {
	WALMinSyncIntervalCallback    durationFunc
	LogWriterFsyncLatencyCallback recordValueFunc

nit: can we drop the Callback suffix from these fields? and rename LogWriterFsyncLatencyCallback to OnFsync?


record/log_writer.go line 307 at r2 (raw file):

Previously, coolcom200 (Leon Fattakhov) wrote…

Should we move logNum into the logWriterConfig struct?

Let's keep the required parameters like logNum here, and LogWriterConfig can hold the optional parameters.


record/log_writer.go line 334 at r2 (raw file):

Previously, coolcom200 (Leon Fattakhov) wrote…

This assignment used to be gaurded with a lock however it doesn't seem like this is needed in the NewLogWriter func

Yeah, that's right 👍


record/log_writer_test.go line 299 at r2 (raw file):

			return minSyncInterval
		},
		LogWriterFsyncLatencyCallback: nil,

nit: you can omit this line (unspecified fields in a struct value default to their zero value, in this case nil)


record/log_writer_test.go line 384 at r2 (raw file):

			err := syncLatencyMicros.RecordValue(duration)
			if err != nil {
				t.Fatal("Ran into error when recording values to histogram")

nit: no requirement to use it, but just an fyi, there's a require.NoError(t, err) func we use frequently in tests. both forms are totally fine, but you may find you prefer the shorthand of require.NoError

@coolcom200 coolcom200 force-pushed the 85755-expose-fsync-with-callback branch 2 times, most recently from e88f67e to 11c50de Compare October 11, 2022 17:28
Copy link
Contributor Author

@coolcom200 coolcom200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @jbowens)


options.go line 801 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

how about MetricEventListener to mirror EventListener?

Done.


options.go line 802 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

let's call this WALFysncLatency, and let's add a comment explaining that since this event is called on every WAL fsync, implementations must have very low overhead to avoid impacting performance.

Done.


record/log_writer.go line 230 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

let's make this take a time.Duration now. then if the user subscribing to fsync events wants to truncate precision to the microsecond, they can call .Microsecond() themselves.

Done.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 5 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @coolcom200)


options.go line 802 at r4 (raw file):

// MetricEventListener is a struct that contains the callbacks that pebble will
// call when metric related events occur

nitiest nit: missing a period


options.go line 804 at r4 (raw file):

// call when metric related events occur
type MetricEventListener struct {
	// WALFysncLatency is invoked after each WAL Fsync. As a result, this function

How about

// WALFysncLatency is invoked with the duration of every WAL Fsync. It is invoked
// synchronously within the WAL's flush loop, so this function should not run for an
// excessive amount of time to avoid stalling writes.

The fsync latency was computed using an HDR histogram however this
histogram generates too many buckets of various widths.  In order to
allow clients to store the fsync latency in their preferred format,  a
callback was introduced. This callback is defined on the
`pebble.Options` which enables clients to provide a callback function
which will be triggered each time pebble produces the fsync latency
metrics.

Additionally the `record.NewLogWriter` method was updated to take in a
configuration struct which configures the `LogWriter` with the specified
callbacks.
@coolcom200 coolcom200 force-pushed the 85755-expose-fsync-with-callback branch from 11c50de to f011973 Compare October 12, 2022 13:41
Copy link
Contributor Author

@coolcom200 coolcom200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!!!

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @jbowens)


options.go line 802 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nitiest nit: missing a period

Done.


options.go line 804 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

How about

// WALFysncLatency is invoked with the duration of every WAL Fsync. It is invoked
// synchronously within the WAL's flush loop, so this function should not run for an
// excessive amount of time to avoid stalling writes.

Done.

@coolcom200 coolcom200 merged commit 5fdb3ea into cockroachdb:master Oct 12, 2022
@coolcom200 coolcom200 deleted the 85755-expose-fsync-with-callback branch October 12, 2022 14:02
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Oct 17, 2022
88974: sql: add support for `DELETE FROM ... USING` r=faizaanmadhani a=faizaanmadhani

See commit messages for details.

Resolves: #40963

89459: metrics: expose pebble flush utilization r=jbowens a=coolcom200

Create a new `GaugeFloat64` metric for pebble’s flush utilization. This
metric is not cumulative, rather, it is the metric over an interval.
This interval is determined by the `interval` parameter of the
`Node.startComputePeriodicMetrics` method.

In order to compute the metric over an interval the previous value of
the metric must be stored. As a result, a map is constructed that takes
a pointer to a store and maps it to a pointer to storage metrics:
`make(map[*kvserver.Store]*storage.Metrics)`. This map is passed to
`node.computeMetricsPeriodically` which gets the store to calculate its
metrics and then updates the previous metrics in the map.

Refactor `store.go`'s metric calculation by separating
`ComputeMetrics(ctx context.Context, tick int) error` into two methods:

* `ComputeMetrics(ctx context.Context) error`
* `ComputeMetricsPeriodically(ctx context.Context, prevMetrics
  *storage.Metrics, tick int) (m storage.Metrics, err error)`

Both methods call the `computeMetrics` which contains the common code
between the two calls. Before this, the process for retrieving metrics
instantaneous was to pass a tick value such as `-1` or `0` to the
`ComputeMetrics(ctx context.Context, tick int)` however it can be
done with a call to `ComputeMetrics(ctx context.Context)`

The `store.ComputeMetricsPeriodically` method will also return the
latest storage metrics. These metrics are used to update the mapping
between stores and metrics used for computing the metric delta over an
interval.

Release Note: None

Resolves part of #85755
Depends on #88972, cockroachdb/pebble#2001
Epic: CRDB-17515


89656: roachtest: introduce admission-control/elastic-cdc r=irfansharif a=irfansharif

Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency.

Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope.

Release note: None

Co-authored-by: Faizaan Madhani <[email protected]>
Co-authored-by: Leon Fattakhov <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants