metrics: compute `InternalIntervalMetrics` without making calls to `GetInternalIntervalMetrics` #88972

coolcom200 · 2022-09-28T22:21:54Z

Remove dependence on the GetInternalIntervalMetrics as this method resets the interval following each call. This results in issues if two clients interact with GetInternalIntervalMetrics as they will reset the other client’s interval. Ex: Client A calls every 5 seconds and Client B calls every 10 seconds. Client B will end up getting metrics at a 5-second interval since GetInternalIntervalMetrics resets the interval on each call.

The admission.StoreMetrics are updated to remove the InternalIntervalField as these metrics need to be computed based off the current metrics admission.StoreMetrics.Metrics and the previous metrics. The storage of the previous metrics is left up to the client. Removing the field was chosen instead of setting it to a nil value as it would be confusing to return the StoreMetrics and require the client to populate the field when they compute the necessary metrics.

For the admission control use case, the flush metrics are computed as part of ioLoadListener.adjustTokens with the previous metrics being initialized in ioLoadListener.pebbleMetricsTick and stored on the ioLoadListenerState as cumFlushWriteThroughput.

Additionally, make a copy of the pebble.InternalIntervalMetrics in order to move away from using the pebble version of the interval metrics as these metrics are now the responsibility of the caller to compute.

Release note: None
Epic: CRDB-17515

cockroach-teamcity · 2022-09-28T22:22:05Z

This change is

sumeerbhola

Reviewed 1 of 6 files at r1, 6 of 8 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @coolcom200 and @irfansharif)

-- commits line 18 at r2:
nit: necessary

-- commits line 23 at r2:
nit: goroutine

pkg/util/admission/granter.go line 651 at r2 (raw file):

}

// IntervalMetrics exposes metrics about internal subsystems, that can

Let's change the name to storeInternalIntervalMetrics (also making it package private).
Also simplify the comment to
// storeInternalIntervalMetrics exposes metrics about internal subsystems in the store,
// that are useful for admission control. These represent the metrics over ...

pkg/util/admission/granter.go line 659 at r2 (raw file):

type IntervalMetrics struct {
	// LogWriter metrics.
	LogWriter struct {

make all of these package private

pkg/util/admission/granter.go line 686 at r2 (raw file):

// StoreAndIntervalMetrics is a wrapper around StoreMetrics that also contains
// the IntervalMetrics
type StoreAndIntervalMetrics struct {

storeAndIntervalMetrics since it should be package private.

pkg/util/admission/granter.go line 687 at r2 (raw file):

// the IntervalMetrics
type StoreAndIntervalMetrics struct {
	*StoreMetrics

StoreMetrics is being passed by value to pebbleMetricsTick. We should keep it as value and not a pointer.

But let's step back for a minute. The diff computation in StoreGrantCoordinators isn't a good place since it has to keep state about each store and all the slice index complexity (and not quite correct, since the index is not required to be stable every time we fetch the metrics).

We could move it to GrantCoordinator.pebbleMetricsTick since there is a GrantCoordinator per store. But this isn't really the role of GrantCoordinator, which isn't supposed to know about the details about store metrics. The ioLoadListener already converts cumulative to deltas for all the stuff it computes over, so that is the right place. Currently it does it in a somewhat adhoc manner and I would suggest something simpler for your change.

Define the following in io_load_listener.go (only the flush throughput for now, since that's all we use in admission control adjustTokensInner. We can add more later if needed)

// storeInternalMetrics can be used to represent both cumulative and interval state for internal store subsystems.
type storeInternalMetrics {
	// Flush loop metrics.
	Flush struct {
		// WriteThroughput is the flushing throughput.
		WriteThroughput pebble.ThroughputMetric
	}
}

type ioLoadListenerState {
     ...

     // Cumulative
     cumStoreInternalMetrics storeInternalMetrics
}

Put this early in ioLoadListernerState with all the other cumulative fields.

In ioLoadListener.pebbleMetricsTick in the initialization if-block copy this out. In the non-initialization path, do the subtraction and pass the delta as an explicit parameter to ioLoadListener.adjustTokens.

There are various tests that are currently calling pebbleMetricsTick and those tests are assuming they should pass a delta for this flush throughput, but now the callee assumes it is being given a cumulative. Luckily most of these tests were passing an empty delta, so they are not affected. The only non-superficial test change will be in TestIOLoadListener. See where that test has the following code

				var flushBytes, flushWorkSec, flushIdleSec int
				if d.HasArg("flush-bytes") {
					d.ScanArgs(t, "flush-bytes", &flushBytes)
					d.ScanArgs(t, "flush-work-sec", &flushWorkSec)
					d.ScanArgs(t, "flush-idle-sec", &flushIdleSec)
				}
				flushMetric := pebble.ThroughputMetric{
					Bytes:        int64(flushBytes),
					WorkDuration: time.Duration(flushWorkSec) * time.Second,
					IdleDuration: time.Duration(flushIdleSec) * time.Second,
				}
				im := &pebble.InternalIntervalMetrics{}
				im.Flush.WriteThroughput = flushMetric

Let's keep the testdata files unchanged, so that the testdata is still providing us interval values. So the test code needs to convert these interval values to cumulative. It can do so easily by keeping additional 3 variables cumFlushBytes, cumFlushWorkSec, cumFlushIdleSec declared above the call to datadriven.RunTest. Then add what the testcase provided to the current cumulative values and pass those in. Whenever the testcase does "init", to create a new ioLoadListener, set these cumulative values back to 0.

coolcom200

Reviewed 2 of 6 files at r1, 8 of 8 files at r2, 7 of 7 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)

pkg/util/admission/granter.go line 651 at r2 (raw file):

Previously, sumeerbhola wrote…

Let's change the name to storeInternalIntervalMetrics (also making it package private).
Also simplify the comment to
// storeInternalIntervalMetrics exposes metrics about internal subsystems in the store,
// that are useful for admission control. These represent the metrics over ...

Done.

pkg/util/admission/granter.go line 659 at r2 (raw file):

Previously, sumeerbhola wrote…

make all of these package private

Done.

pkg/util/admission/granter.go line 686 at r2 (raw file):

Previously, sumeerbhola wrote…

storeAndIntervalMetrics since it should be package private.

Done.

pkg/util/admission/granter.go line 687 at r2 (raw file):

Previously, sumeerbhola wrote…

StoreMetrics is being passed by value to pebbleMetricsTick. We should keep it as value and not a pointer.

But let's step back for a minute. The diff computation in StoreGrantCoordinators isn't a good place since it has to keep state about each store and all the slice index complexity (and not quite correct, since the index is not required to be stable every time we fetch the metrics).

We could move it to GrantCoordinator.pebbleMetricsTick since there is a GrantCoordinator per store. But this isn't really the role of GrantCoordinator, which isn't supposed to know about the details about store metrics. The ioLoadListener already converts cumulative to deltas for all the stuff it computes over, so that is the right place. Currently it does it in a somewhat adhoc manner and I would suggest something simpler for your change.

Define the following in io_load_listener.go (only the flush throughput for now, since that's all we use in admission control adjustTokensInner. We can add more later if needed)
// storeInternalMetrics can be used to represent both cumulative and interval state for internal store subsystems.
type storeInternalMetrics {
	// Flush loop metrics.
	Flush struct {
		// WriteThroughput is the flushing throughput.
		WriteThroughput pebble.ThroughputMetric
	}
}
type ioLoadListenerState {
     ...

     // Cumulative
     cumStoreInternalMetrics storeInternalMetrics
}
Put this early in ioLoadListernerState with all the other cumulative fields.

In ioLoadListener.pebbleMetricsTick in the initialization if-block copy this out. In the non-initialization path, do the subtraction and pass the delta as an explicit parameter to ioLoadListener.adjustTokens.

There are various tests that are currently calling pebbleMetricsTick and those tests are assuming they should pass a delta for this flush throughput, but now the callee assumes it is being given a cumulative. Luckily most of these tests were passing an empty delta, so they are not affected. The only non-superficial test change will be in TestIOLoadListener. See where that test has the following code
				var flushBytes, flushWorkSec, flushIdleSec int
				if d.HasArg("flush-bytes") {
					d.ScanArgs(t, "flush-bytes", &flushBytes)
					d.ScanArgs(t, "flush-work-sec", &flushWorkSec)
					d.ScanArgs(t, "flush-idle-sec", &flushIdleSec)
				}
				flushMetric := pebble.ThroughputMetric{
					Bytes:        int64(flushBytes),
					WorkDuration: time.Duration(flushWorkSec) * time.Second,
					IdleDuration: time.Duration(flushIdleSec) * time.Second,
				}
				im := &pebble.InternalIntervalMetrics{}
				im.Flush.WriteThroughput = flushMetric
Let's keep the testdata files unchanged, so that the testdata is still providing us interval values. So the test code needs to convert these interval values to cumulative. It can do so easily by keeping additional 3 variables cumFlushBytes, cumFlushWorkSec, cumFlushIdleSec declared above the call to datadriven.RunTest. Then add what the testcase provided to the current cumulative values and pass those in. Whenever the testcase does "init", to create a new ioLoadListener, set these cumulative values back to 0.

Oh this is a way better way to organize the code! Thank you for bringing this up I have updated it!

sumeerbhola

Reviewed 2 of 7 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @coolcom200, @irfansharif, and @sumeerbhola)

-- commits line 23 at r4:
Reminder to update the PR description to be the same as this commit.

pkg/util/admission/io_load_listener.go line 175 at r4 (raw file):

// store, that are useful for admission control. These represent the metrics
// over the interval of time from the last call to retrieve these metrics. These
// are not cumulative, unlike Metrics.

s/storeInternalIntervalMetrics/storeInternalMetrics

And change the comment to say something like:
These can represent either cumulative metrics, or over an interval of time. The latter is possible since the data representation here allows for subtraction between two cumulative storeInternalMetrics.

Actually, since you are calling adjustTokensInner with pebble.ThroughputMetric, just remove this wrapper struct. It is not adding any value, and we are not preserving the wrapping through the method calls. Then we simply have cumFlushWriteThroughput as a field in ioLoadListenerState.

sumeerbhola

Reviewed all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @coolcom200, @irfansharif, and @sumeerbhola)

pkg/util/admission/io_load_listener.go line 313 at r5 (raw file):

		io.diskBW.bytesWritten = metrics.DiskStats.BytesWritten
		io.diskBW.incomingLSMBytes = cumLSMIncomingBytes
		io.setCumulativeStoreInternalIntervalMetrics(metrics)

isn't this simply io.cumFlushWriteThroughput = metrics.Flush.WriteThroughput?

Remove dependence on the `GetInternalIntervalMetrics` as this method resets the interval following each call. This results in issues if two clients interact with `GetInternalIntervalMetrics` as they will reset the other client’s interval. Ex: Client A calls every 5 seconds and Client B calls every 10 seconds. Client B will end up getting metrics at a 5-second interval since `GetInternalIntervalMetrics` resets the interval on each call. The `admission.StoreMetrics` are updated to remove the `InternalIntervalField` as these metrics need to be computed based off the current metrics `admission.StoreMetrics.Metrics` and the previous metrics. The storage of the previous metrics is left up to the client. Removing the field was chosen instead of setting it to a `nil` value as it would be confusing to return the `StoreMetrics` and require the client to populate the field when they compute the necessary metrics. For the admission control use case, the flush metrics are computed as part of `ioLoadListener.adjustTokens` with the previous metrics being initialized in `ioLoadListener.pebbleMetricsTick` and stored on the `ioLoadListenerState` as `cumFlushWriteThroughput`. Additionally, make a copy of the `pebble.InternalIntervalMetrics` in order to move away from using the `pebble` version of the interval metrics as these metrics are now the responsibility of the caller to compute. Release note: None

coolcom200

Reviewed 1 of 2 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @sumeerbhola)

-- commits line 23 at r4:

Previously, sumeerbhola wrote…

Reminder to update the PR description to be the same as this commit.

Done.

pkg/util/admission/io_load_listener.go line 175 at r4 (raw file):

Previously, sumeerbhola wrote…

s/storeInternalIntervalMetrics/storeInternalMetrics

And change the comment to say something like:
These can represent either cumulative metrics, or over an interval of time. The latter is possible since the data representation here allows for subtraction between two cumulative storeInternalMetrics.

Actually, since you are calling adjustTokensInner with pebble.ThroughputMetric, just remove this wrapper struct. It is not adding any value, and we are not preserving the wrapping through the method calls. Then we simply have cumFlushWriteThroughput as a field in ioLoadListenerState.

Done.

pkg/util/admission/io_load_listener.go line 313 at r5 (raw file):

Previously, sumeerbhola wrote…

isn't this simply io.cumFlushWriteThroughput = metrics.Flush.WriteThroughput?

Yes it is thank you for pointing this out! Updated!

sumeerbhola

Reviewed 1 of 2 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

coolcom200 · 2022-10-11T13:23:45Z

bors r+

craig · 2022-10-11T14:31:13Z

Build succeeded:

Bazel Essential CI (Cockroach)

88974: sql: add support for `DELETE FROM ... USING` r=faizaanmadhani a=faizaanmadhani See commit messages for details. Resolves: #40963 89459: metrics: expose pebble flush utilization r=jbowens a=coolcom200 Create a new `GaugeFloat64` metric for pebble’s flush utilization. This metric is not cumulative, rather, it is the metric over an interval. This interval is determined by the `interval` parameter of the `Node.startComputePeriodicMetrics` method. In order to compute the metric over an interval the previous value of the metric must be stored. As a result, a map is constructed that takes a pointer to a store and maps it to a pointer to storage metrics: `make(map[*kvserver.Store]*storage.Metrics)`. This map is passed to `node.computeMetricsPeriodically` which gets the store to calculate its metrics and then updates the previous metrics in the map. Refactor `store.go`'s metric calculation by separating `ComputeMetrics(ctx context.Context, tick int) error` into two methods: * `ComputeMetrics(ctx context.Context) error` * `ComputeMetricsPeriodically(ctx context.Context, prevMetrics *storage.Metrics, tick int) (m storage.Metrics, err error)` Both methods call the `computeMetrics` which contains the common code between the two calls. Before this, the process for retrieving metrics instantaneous was to pass a tick value such as `-1` or `0` to the `ComputeMetrics(ctx context.Context, tick int)` however it can be done with a call to `ComputeMetrics(ctx context.Context)` The `store.ComputeMetricsPeriodically` method will also return the latest storage metrics. These metrics are used to update the mapping between stores and metrics used for computing the metric delta over an interval. Release Note: None Resolves part of #85755 Depends on #88972, cockroachdb/pebble#2001 Epic: CRDB-17515 89656: roachtest: introduce admission-control/elastic-cdc r=irfansharif a=irfansharif Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency. Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope. Release note: None Co-authored-by: Faizaan Madhani <[email protected]> Co-authored-by: Leon Fattakhov <[email protected]> Co-authored-by: irfan sharif <[email protected]>

coolcom200 force-pushed the migrate-off-internalintervalmetrics branch 3 times, most recently from a9d9a7a to c72f41d Compare October 4, 2022 13:46

coolcom200 marked this pull request as ready for review October 4, 2022 13:49

coolcom200 requested review from a team as code owners October 4, 2022 13:49

irfansharif self-requested a review October 4, 2022 21:43

sumeerbhola requested changes Oct 5, 2022

View reviewed changes

coolcom200 force-pushed the migrate-off-internalintervalmetrics branch from c72f41d to ae1cd0f Compare October 5, 2022 21:47

coolcom200 commented Oct 5, 2022

View reviewed changes

coolcom200 force-pushed the migrate-off-internalintervalmetrics branch from ae1cd0f to a6b9572 Compare October 5, 2022 22:00

coolcom200 mentioned this pull request Oct 6, 2022

metrics: expose pebble flush utilization #89459

Merged

sumeerbhola approved these changes Oct 6, 2022

View reviewed changes

coolcom200 force-pushed the migrate-off-internalintervalmetrics branch 2 times, most recently from 800310e to 1e07034 Compare October 6, 2022 18:20

sumeerbhola requested changes Oct 6, 2022

View reviewed changes

coolcom200 force-pushed the migrate-off-internalintervalmetrics branch from 1e07034 to ce3aff4 Compare October 6, 2022 20:21

coolcom200 commented Oct 10, 2022

View reviewed changes

sumeerbhola approved these changes Oct 10, 2022

View reviewed changes

craig bot merged commit b0267f6 into cockroachdb:master Oct 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: compute `InternalIntervalMetrics` without making calls to `GetInternalIntervalMetrics` #88972

metrics: compute `InternalIntervalMetrics` without making calls to `GetInternalIntervalMetrics` #88972

coolcom200 commented Sep 28, 2022 •

edited

Loading

cockroach-teamcity commented Sep 28, 2022

sumeerbhola left a comment

coolcom200 left a comment

sumeerbhola left a comment

sumeerbhola left a comment

coolcom200 left a comment

sumeerbhola left a comment

coolcom200 commented Oct 11, 2022

craig bot commented Oct 11, 2022

metrics: compute InternalIntervalMetrics without making calls to GetInternalIntervalMetrics #88972

metrics: compute InternalIntervalMetrics without making calls to GetInternalIntervalMetrics #88972

Conversation

coolcom200 commented Sep 28, 2022 • edited Loading

cockroach-teamcity commented Sep 28, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

coolcom200 left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

coolcom200 left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

coolcom200 commented Oct 11, 2022

craig bot commented Oct 11, 2022

metrics: compute `InternalIntervalMetrics` without making calls to `GetInternalIntervalMetrics` #88972

metrics: compute `InternalIntervalMetrics` without making calls to `GetInternalIntervalMetrics` #88972

coolcom200 commented Sep 28, 2022 •

edited

Loading