Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: export info in pebble.InternalIntervalMetrics as CockroachDB metrics #85755

Closed
sumeerbhola opened this issue Aug 8, 2022 · 2 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members. T-storage Storage Team

Comments

@sumeerbhola
Copy link
Collaborator

sumeerbhola commented Aug 8, 2022

These would help us understand bottlenecks and slowness inside Pebble. Specifically, I think it would be useful to have

  • fsync latency: our only current metric is higher-level (raft.process.logcommit.latency) and does not give us a direct view into the fsync latency Pebble is observing.
  • Flush utilization: This can be computed as ThroughputMetric's WorkDuration/(WorkDuration+IdleDuration) and will give us a better understanding of write stalls due to high memtable count.

There is additional info in InternalIntervalMetrics that may also be helpful, like the various LogWriter utilization values, though I don't know of any tickets where they could have been helpful.

type InternalIntervalMetrics struct {
	// LogWriter metrics.
	LogWriter struct {
		// WriteThroughput is the WAL throughput.
		WriteThroughput ThroughputMetric
		// PendingBufferUtilization is the utilization of the WAL writer's
		// finite-sized pending blocks buffer. It provides an additional signal
		// regarding how close to "full" the WAL writer is. The value is in the
		// interval [0,1].
		PendingBufferUtilization float64
		// SyncQueueUtilization is the utilization of the WAL writer's
		// finite-sized queue of work that is waiting to sync. The value is in the
		// interval [0,1].
		SyncQueueUtilization float64
		// SyncLatencyMicros is a distribution of the fsync latency observed by
		// the WAL writer. It can be nil if there were no fsyncs.
		SyncLatencyMicros *hdrhistogram.Histogram
	}
	// Flush loop metrics.
	Flush struct {
		// WriteThroughput is the flushing throughput.
		WriteThroughput ThroughputMetric
	}
	// NB: the LogWriter throughput and the Flush throughput are not directly
	// comparable because the former does not compress, unlike the latter.
}

Jira issue: CRDB-18424

Epic CRDB-20293

@sumeerbhola sumeerbhola added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Aug 8, 2022
@jbowens jbowens self-assigned this Aug 29, 2022
@jbowens
Copy link
Collaborator

jbowens commented Aug 29, 2022

Eyeing this as a starter project for Leon.

@jbowens jbowens added the E-starter Might be suitable for a starter project for new employees or team members. label Aug 29, 2022
@jbowens jbowens assigned coolcom200 and unassigned jbowens Sep 12, 2022
craig bot pushed a commit that referenced this issue Oct 17, 2022
88974: sql: add support for `DELETE FROM ... USING` r=faizaanmadhani a=faizaanmadhani

See commit messages for details.

Resolves: #40963

89459: metrics: expose pebble flush utilization r=jbowens a=coolcom200

Create a new `GaugeFloat64` metric for pebble’s flush utilization. This
metric is not cumulative, rather, it is the metric over an interval.
This interval is determined by the `interval` parameter of the
`Node.startComputePeriodicMetrics` method.

In order to compute the metric over an interval the previous value of
the metric must be stored. As a result, a map is constructed that takes
a pointer to a store and maps it to a pointer to storage metrics:
`make(map[*kvserver.Store]*storage.Metrics)`. This map is passed to
`node.computeMetricsPeriodically` which gets the store to calculate its
metrics and then updates the previous metrics in the map.

Refactor `store.go`'s metric calculation by separating
`ComputeMetrics(ctx context.Context, tick int) error` into two methods:

* `ComputeMetrics(ctx context.Context) error`
* `ComputeMetricsPeriodically(ctx context.Context, prevMetrics
  *storage.Metrics, tick int) (m storage.Metrics, err error)`

Both methods call the `computeMetrics` which contains the common code
between the two calls. Before this, the process for retrieving metrics
instantaneous was to pass a tick value such as `-1` or `0` to the
`ComputeMetrics(ctx context.Context, tick int)` however it can be
done with a call to `ComputeMetrics(ctx context.Context)`

The `store.ComputeMetricsPeriodically` method will also return the
latest storage metrics. These metrics are used to update the mapping
between stores and metrics used for computing the metric delta over an
interval.

Release Note: None

Resolves part of #85755
Depends on #88972, cockroachdb/pebble#2001
Epic: CRDB-17515


89656: roachtest: introduce admission-control/elastic-cdc r=irfansharif a=irfansharif

Part of #89208. This test sets up a 3-node CRDB cluster on 8vCPU machines running 1000-warehouse TPC-C, and kicks off a few changefeed backfills concurrently. We've observed latency spikes during backfills because of its CPU/scan-heavy nature -- it can elevate CPU scheduling latencies which in turn translates to an increase in foreground latency.

Also in this commit: routing std{err,out} from prometheus/grafana setup that roachtests do to the logger in scope.

Release note: None

Co-authored-by: Faizaan Madhani <[email protected]>
Co-authored-by: Leon Fattakhov <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
@coolcom200
Copy link
Contributor

Closing as #90082 and #89459 are both merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-starter Might be suitable for a starter project for new employees or team members. T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

3 participants