Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: compute InternalIntervalMetrics without making calls to GetInternalIntervalMetrics #88972

Merged

Conversation

coolcom200
Copy link
Contributor

@coolcom200 coolcom200 commented Sep 28, 2022

Remove dependence on the GetInternalIntervalMetrics as this method resets the interval following each call. This results in issues if two clients interact with GetInternalIntervalMetrics as they will reset the other client’s interval. Ex: Client A calls every 5 seconds and Client B calls every 10 seconds. Client B will end up getting metrics at a 5-second interval since GetInternalIntervalMetrics resets the interval on each call.

The admission.StoreMetrics are updated to remove the InternalIntervalField as these metrics need to be computed based off the current metrics admission.StoreMetrics.Metrics and the previous metrics. The storage of the previous metrics is left up to the client. Removing the field was chosen instead of setting it to a nil value as it would be confusing to return the StoreMetrics and require the client to populate the field when they compute the necessary metrics.

For the admission control use case, the flush metrics are computed as part of ioLoadListener.adjustTokens with the previous metrics being initialized in ioLoadListener.pebbleMetricsTick and stored on the ioLoadListenerState as cumFlushWriteThroughput.

Additionally, make a copy of the pebble.InternalIntervalMetrics in order to move away from using the pebble version of the interval metrics as these metrics are now the responsibility of the caller to compute.

Release note: None
Epic: CRDB-17515

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@coolcom200 coolcom200 force-pushed the migrate-off-internalintervalmetrics branch 3 times, most recently from a9d9a7a to c72f41d Compare October 4, 2022 13:46
@coolcom200 coolcom200 marked this pull request as ready for review October 4, 2022 13:49
@coolcom200 coolcom200 requested review from a team as code owners October 4, 2022 13:49
@irfansharif irfansharif self-requested a review October 4, 2022 21:43
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r1, 6 of 8 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @coolcom200 and @irfansharif)


-- commits line 18 at r2:
nit: necessary


-- commits line 23 at r2:
nit: goroutine


pkg/util/admission/granter.go line 651 at r2 (raw file):

}

// IntervalMetrics exposes metrics about internal subsystems, that can

Let's change the name to storeInternalIntervalMetrics (also making it package private).
Also simplify the comment to
// storeInternalIntervalMetrics exposes metrics about internal subsystems in the store,
// that are useful for admission control. These represent the metrics over ...


pkg/util/admission/granter.go line 659 at r2 (raw file):

type IntervalMetrics struct {
	// LogWriter metrics.
	LogWriter struct {

make all of these package private


pkg/util/admission/granter.go line 686 at r2 (raw file):

// StoreAndIntervalMetrics is a wrapper around StoreMetrics that also contains
// the IntervalMetrics
type StoreAndIntervalMetrics struct {

storeAndIntervalMetrics since it should be package private.


pkg/util/admission/granter.go line 687 at r2 (raw file):

// the IntervalMetrics
type StoreAndIntervalMetrics struct {
	*StoreMetrics

StoreMetrics is being passed by value to pebbleMetricsTick. We should keep it as value and not a pointer.

But let's step back for a minute. The diff computation in StoreGrantCoordinators isn't a good place since it has to keep state about each store and all the slice index complexity (and not quite correct, since the index is not required to be stable every time we fetch the metrics).

We could move it to GrantCoordinator.pebbleMetricsTick since there is a GrantCoordinator per store. But this isn't really the role of GrantCoordinator, which isn't supposed to know about the details about store metrics. The ioLoadListener already converts cumulative to deltas for all the stuff it computes over, so that is the right place. Currently it does it in a somewhat adhoc manner and I would suggest something simpler for your change.

Define the following in io_load_listener.go (only the flush throughput for now, since that's all we use in admission control adjustTokensInner. We can add more later if needed)

// storeInternalMetrics can be used to represent both cumulative and interval state for internal store subsystems.
type storeInternalMetrics {
	// Flush loop metrics.
	Flush struct {
		// WriteThroughput is the flushing throughput.
		WriteThroughput pebble.ThroughputMetric
	}
}
type ioLoadListenerState {
     ...

     // Cumulative
     cumStoreInternalMetrics storeInternalMetrics
}

Put this early in ioLoadListernerState with all the other cumulative fields.

In ioLoadListener.pebbleMetricsTick in the initialization if-block copy this out. In the non-initialization path, do the subtraction and pass the delta as an explicit parameter to ioLoadListener.adjustTokens.

There are various tests that are currently calling pebbleMetricsTick and those tests are assuming they should pass a delta for this flush throughput, but now the callee assumes it is being given a cumulative. Luckily most of these tests were passing an empty delta, so they are not affected. The only non-superficial test change will be in TestIOLoadListener. See where that test has the following code

				var flushBytes, flushWorkSec, flushIdleSec int
				if d.HasArg("flush-bytes") {
					d.ScanArgs(t, "flush-bytes", &flushBytes)
					d.ScanArgs(t, "flush-work-sec", &flushWorkSec)
					d.ScanArgs(t, "flush-idle-sec", &flushIdleSec)
				}
				flushMetric := pebble.ThroughputMetric{
					Bytes:        int64(flushBytes),
					WorkDuration: time.Duration(flushWorkSec) * time.Second,
					IdleDuration: time.Duration(flushIdleSec) * time.Second,
				}
				im := &pebble.InternalIntervalMetrics{}
				im.Flush.WriteThroughput = flushMetric

Let's keep the testdata files unchanged, so that the testdata is still providing us interval values. So the test code needs to convert these interval values to cumulative. It can do so easily by keeping additional 3 variables cumFlushBytes, cumFlushWorkSec, cumFlushIdleSec declared above the call to datadriven.RunTest. Then add what the testcase provided to the current cumulative values and pass those in. Whenever the testcase does "init", to create a new ioLoadListener, set these cumulative values back to 0.

@coolcom200 coolcom200 force-pushed the migrate-off-internalintervalmetrics branch from c72f41d to ae1cd0f Compare October 5, 2022 21:47
Copy link
Contributor Author

@coolcom200 coolcom200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 6 files at r1, 8 of 8 files at r2, 7 of 7 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @sumeerbhola)


pkg/util/admission/granter.go line 651 at r2 (raw file):

Previously, sumeerbhola wrote…

Let's change the name to storeInternalIntervalMetrics (also making it package private).
Also simplify the comment to
// storeInternalIntervalMetrics exposes metrics about internal subsystems in the store,
// that are useful for admission control. These represent the metrics over ...

Done.


pkg/util/admission/granter.go line 659 at r2 (raw file):

Previously, sumeerbhola wrote…

make all of these package private

Done.


pkg/util/admission/granter.go line 686 at r2 (raw file):

Previously, sumeerbhola wrote…

storeAndIntervalMetrics since it should be package private.

Done.


pkg/util/admission/granter.go line 687 at r2 (raw file):

Previously, sumeerbhola wrote…

StoreMetrics is being passed by value to pebbleMetricsTick. We should keep it as value and not a pointer.

But let's step back for a minute. The diff computation in StoreGrantCoordinators isn't a good place since it has to keep state about each store and all the slice index complexity (and not quite correct, since the index is not required to be stable every time we fetch the metrics).

We could move it to GrantCoordinator.pebbleMetricsTick since there is a GrantCoordinator per store. But this isn't really the role of GrantCoordinator, which isn't supposed to know about the details about store metrics. The ioLoadListener already converts cumulative to deltas for all the stuff it computes over, so that is the right place. Currently it does it in a somewhat adhoc manner and I would suggest something simpler for your change.

Define the following in io_load_listener.go (only the flush throughput for now, since that's all we use in admission control adjustTokensInner. We can add more later if needed)

// storeInternalMetrics can be used to represent both cumulative and interval state for internal store subsystems.
type storeInternalMetrics {
	// Flush loop metrics.
	Flush struct {
		// WriteThroughput is the flushing throughput.
		WriteThroughput pebble.ThroughputMetric
	}
}
type ioLoadListenerState {
     ...

     // Cumulative
     cumStoreInternalMetrics storeInternalMetrics
}

Put this early in ioLoadListernerState with all the other cumulative fields.

In ioLoadListener.pebbleMetricsTick in the initialization if-block copy this out. In the non-initialization path, do the subtraction and pass the delta as an explicit parameter to ioLoadListener.adjustTokens.

There are various tests that are currently calling pebbleMetricsTick and those tests are assuming they should pass a delta for this flush throughput, but now the callee assumes it is being given a cumulative. Luckily most of these tests were passing an empty delta, so they are not affected. The only non-superficial test change will be in TestIOLoadListener. See where that test has the following code

				var flushBytes, flushWorkSec, flushIdleSec int
				if d.HasArg("flush-bytes") {
					d.ScanArgs(t, "flush-bytes", &flushBytes)
					d.ScanArgs(t, "flush-work-sec", &flushWorkSec)
					d.ScanArgs(t, "flush-idle-sec", &flushIdleSec)
				}
				flushMetric := pebble.ThroughputMetric{
					Bytes:        int64(flushBytes),
					WorkDuration: time.Duration(flushWorkSec) * time.Second,
					IdleDuration: time.Duration(flushIdleSec) * time.Second,
				}
				im := &pebble.InternalIntervalMetrics{}
				im.Flush.WriteThroughput = flushMetric

Let's keep the testdata files unchanged, so that the testdata is still providing us interval values. So the test code needs to convert these interval values to cumulative. It can do so easily by keeping additional 3 variables cumFlushBytes, cumFlushWorkSec, cumFlushIdleSec declared above the call to datadriven.RunTest. Then add what the testcase provided to the current cumulative values and pass those in. Whenever the testcase does "init", to create a new ioLoadListener, set these cumulative values back to 0.

Oh this is a way better way to organize the code! Thank you for bringing this up I have updated it!

@coolcom200 coolcom200 force-pushed the migrate-off-internalintervalmetrics branch from ae1cd0f to a6b9572 Compare October 5, 2022 22:00
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 7 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @coolcom200, @irfansharif, and @sumeerbhola)


-- commits line 23 at r4:
Reminder to update the PR description to be the same as this commit.


pkg/util/admission/io_load_listener.go line 175 at r4 (raw file):

// store, that are useful for admission control. These represent the metrics
// over the interval of time from the last call to retrieve these metrics. These
// are not cumulative, unlike Metrics.

s/storeInternalIntervalMetrics/storeInternalMetrics

And change the comment to say something like:
These can represent either cumulative metrics, or over an interval of time. The latter is possible since the data representation here allows for subtraction between two cumulative storeInternalMetrics.

Actually, since you are calling adjustTokensInner with pebble.ThroughputMetric, just remove this wrapper struct. It is not adding any value, and we are not preserving the wrapping through the method calls. Then we simply have cumFlushWriteThroughput as a field in ioLoadListenerState.

@coolcom200 coolcom200 force-pushed the migrate-off-internalintervalmetrics branch 2 times, most recently from 800310e to 1e07034 Compare October 6, 2022 18:20
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @coolcom200, @irfansharif, and @sumeerbhola)


pkg/util/admission/io_load_listener.go line 313 at r5 (raw file):

		io.diskBW.bytesWritten = metrics.DiskStats.BytesWritten
		io.diskBW.incomingLSMBytes = cumLSMIncomingBytes
		io.setCumulativeStoreInternalIntervalMetrics(metrics)

isn't this simply io.cumFlushWriteThroughput = metrics.Flush.WriteThroughput?

Remove dependence on the `GetInternalIntervalMetrics` as this method
resets the interval following each call. This results in issues if two
clients interact with `GetInternalIntervalMetrics` as they will reset
the other client’s interval. Ex: Client A calls every 5 seconds and
Client B calls every 10 seconds. Client B will end up getting metrics at
a 5-second interval since `GetInternalIntervalMetrics` resets the
interval on each call.

The `admission.StoreMetrics` are updated to remove the
`InternalIntervalField` as these metrics need to be computed based off
the current metrics `admission.StoreMetrics.Metrics` and the previous
metrics. The storage of the previous metrics is left up to the client.
Removing the field was chosen instead of setting it to a `nil` value as
it would be confusing to return the `StoreMetrics` and require the
client to populate the field when they compute the necessary metrics.

For the admission control use case, the flush metrics are computed as
part of `ioLoadListener.adjustTokens` with the previous metrics being
initialized in `ioLoadListener.pebbleMetricsTick` and stored on the
`ioLoadListenerState` as `cumFlushWriteThroughput`.

Additionally, make a copy of the `pebble.InternalIntervalMetrics` in
order to move away from using the `pebble` version of the interval
metrics as these metrics are now the responsibility of the caller to
compute.

Release note: None
@coolcom200 coolcom200 force-pushed the migrate-off-internalintervalmetrics branch from 1e07034 to ce3aff4 Compare October 6, 2022 20:21
Copy link
Contributor Author

@coolcom200 coolcom200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @sumeerbhola)


-- commits line 23 at r4:

Previously, sumeerbhola wrote…

Reminder to update the PR description to be the same as this commit.

Done.


pkg/util/admission/io_load_listener.go line 175 at r4 (raw file):

Previously, sumeerbhola wrote…

s/storeInternalIntervalMetrics/storeInternalMetrics

And change the comment to say something like:
These can represent either cumulative metrics, or over an interval of time. The latter is possible since the data representation here allows for subtraction between two cumulative storeInternalMetrics.

Actually, since you are calling adjustTokensInner with pebble.ThroughputMetric, just remove this wrapper struct. It is not adding any value, and we are not preserving the wrapping through the method calls. Then we simply have cumFlushWriteThroughput as a field in ioLoadListenerState.

Done.


pkg/util/admission/io_load_listener.go line 313 at r5 (raw file):

Previously, sumeerbhola wrote…

isn't this simply io.cumFlushWriteThroughput = metrics.Flush.WriteThroughput?

Yes it is thank you for pointing this out! Updated!

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

@coolcom200
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Oct 11, 2022

Build succeeded:

@craig craig bot merged commit b0267f6 into cockroachdb:master Oct 11, 2022
craig bot pushed a commit that referenced this pull request Oct 17, 2022
88974: sql: add support for `DELETE FROM ... USING` r=faizaanmadhani a=faizaanmadhani

See commit messages for details.

Resolves: #40963

89459: metrics: expose pebble flush utilization r=jbowens a=coolcom200

Create a new `GaugeFloat64` metric for pebble’s flush utilization. This
metric is not cumulative, rather, it is the metric over an interval.
This interval is determined by the `interval` parameter of the
`Node.startComputePeriodicMetrics` method.

In order to compute the metric over an interval the previous value of
the metric must be stored. As a result, a map is constructed that takes
a pointer to a store and maps it to a pointer to storage metrics:
`make(map[*kvserver.Store]*storage.Metrics)`. This map is passed to
`node.computeMetricsPeriodically` which gets the store to calculate its
metrics and then updates the previous metrics in the map.

Refactor `store.go`'s metric calculation by separating
`ComputeMetrics(ctx context.Context, tick int) error` into two methods:

* `ComputeMetrics(ctx context.Context) error`
* `ComputeMetricsPeriodically(ctx context.Context, prevMetrics
  *storage.Metrics, tick int) (m storage.Metrics, err error)`

Both methods call the `computeMetrics` which contains the common code
between the two calls. Before this, the process for retrieving metrics
instantaneous was to pass a tick value such as `-1` or `0` to the
`ComputeMetrics(ctx context.Context, tick int)` however it can be
done with a call to `ComputeMetrics(ctx context.Context)`

The `store.ComputeMetricsPeriodically` method will also return the
latest storage metrics. These metrics are used to update the mapping
between stores and metrics used for computing the metric delta over an
interval.

Release Note: None

Resolves part of #85755
Depends on #88972, cockroachdb/pebble#2001
Epic: CRDB-17515


89656: roachtest: introduce admission-control/elastic-cdc r=irfansharif a=irfansharif

Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency.

Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope.

Release note: None

Co-authored-by: Faizaan Madhani <[email protected]>
Co-authored-by: Leon Fattakhov <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants