storage: aggregate iterator stats #99726

jbowens · 2023-03-27T20:30:20Z

Aggregate the iterator stats across all of an engine's iterators. Expose seven
new timeseries metrics for visibility into the behavior of storage engine iterators:

storage.iterator.block-load.bytes
storage.iterator.block-load.cached-bytes
storage.iterator.block-load.read-duration
storage.iterator.external.seeks
storage.iterator.external.steps
storage.iterator.internal.seeks
storage.iterator.internal.steps

Close #95790.
Epic: None
Release note (ops change): Introduces seven new timeseries metrics for better
visibility into the behavior of storage engine iterators and their internals.

cockroach-teamcity · 2023-03-27T20:30:31Z

This change is

tbg

Looks good! Had some comments, the most significant ones are

lock contention, which we might be catching a lot of here, and
whether these metrics should all be counters.

An additional expert review on the smaller code changes when passing the stats around wouldn't hurt, though it looked straightforward enough.

tbg · 2023-03-28T20:08:50Z

pkg/kv/kvserver/metrics.go

@@ -2421,6 +2470,13 @@ func newStoreMetrics(histogramWindow time.Duration) *StoreMetrics {
 		RdbLevelScore:               rdbLevelScore,
 		RdbWriteStalls:              metric.NewGauge(metaRdbWriteStalls),
 		RdbWriteStallNanos:          metric.NewGauge(metaRdbWriteStallNanos),
+		IterBlockBytes:              metric.NewGauge(metaBlockBytes),


I was expecting them all to be counters. They are all gauges. What's going on?

tbg · 2023-03-28T20:09:45Z

pkg/storage/engine.go

+
+// AggregatedIteratorStats holds cumulative stats, collected and summed over all
+// of an engine's iterators.
+type AggregatedIteratorStats struct {


It would be helpful to reference this type in the help text for each of the metrics as a pointer for more detailed semantics of what's tracked within each.

tbg · 2023-03-28T20:11:23Z

pkg/storage/engine.go

+	// directions performed on pebble.Iterators.
+	ExternalSteps int
+	// InternalSeeks is the total count of steps in forward and backward
+	// directions performed on Pebble's internal iterator. If this is high


"If this is high..." could you copy this to the metric help text?

tbg · 2023-03-28T20:11:31Z

pkg/storage/engine.go

+	// Nexts, Prevs, NextPrefix, etc) in forward and backward directions
+	// performed on pebble's internal iterator. If this is high relative to
+	// ExternalSteps, it's a good indication that there's an accumulation of
+	// garbage within the LSM (NOT MVCC garbage).


ditto about "If this is high..."

tbg · 2023-03-28T20:16:21Z

pkg/storage/pebble_iterator.go

+	// reportStats is a function that should be invoked when the iterator is
+	// closed or its stats reset. It's intended to be used with
+	// Pebble.aggregateIterStats to sum iterator stats across all the iterators
+	// during the lifetime of the Engine.


and it can't be nil, right?

jbowens

lock contention, which we might be catching a lot of here, and
whether these metrics should all be counters.

Yeah, these two are related. I believe using a mutex without a standard metric.Counter (which internally uses atomic intrinsics) will be faster and have less contention. The critical section here is so tiny that the cost of synchronization itself should dominate.

A single atomic increment intrinsic is faster than a mutex lock, add, and mutex unlock, but my understanding is that atomic intrinsics pay the cost of synchronization (eg, write barriers, cache line invalidation) per increment. Since we have seven counters to increment, the mutex overhead gets amortized over the seven increments.

eg, running this benchmark on my gceworker:

goos: linux
goarch: amd64
pkg: github.com/mutex
cpu: Intel(R) Xeon(R) CPU @ 2.30GHz
BenchmarkInc
BenchmarkInc/atomic
BenchmarkInc/atomic/parallelism=1
BenchmarkInc/atomic/parallelism=1-24         	25120298	        47.57 ns/op
BenchmarkInc/atomic/parallelism=2
BenchmarkInc/atomic/parallelism=2-24         	 4960356	       255.0 ns/op
BenchmarkInc/atomic/parallelism=4
BenchmarkInc/atomic/parallelism=4-24         	 2159403	       554.2 ns/op
BenchmarkInc/atomic/parallelism=8
BenchmarkInc/atomic/parallelism=8-24         	 1000000	      1140 ns/op
BenchmarkInc/atomic/parallelism=16
BenchmarkInc/atomic/parallelism=16-24        	  476132	      2577 ns/op
BenchmarkInc/atomic/parallelism=32
BenchmarkInc/atomic/parallelism=32-24        	  229970	      5171 ns/op
BenchmarkInc/atomic/parallelism=64
BenchmarkInc/atomic/parallelism=64-24        	  117676	     11069 ns/op
BenchmarkInc/mutex
BenchmarkInc/mutex/parallelism=1
BenchmarkInc/mutex/parallelism=1-24          	49485908	        24.13 ns/op
BenchmarkInc/mutex/parallelism=2
BenchmarkInc/mutex/parallelism=2-24          	19753113	        62.67 ns/op
BenchmarkInc/mutex/parallelism=4
BenchmarkInc/mutex/parallelism=4-24          	 5321437	       229.2 ns/op
BenchmarkInc/mutex/parallelism=8
BenchmarkInc/mutex/parallelism=8-24          	 1689697	       701.5 ns/op
BenchmarkInc/mutex/parallelism=16
BenchmarkInc/mutex/parallelism=16-24         	  693480	      1733 ns/op
BenchmarkInc/mutex/parallelism=32
BenchmarkInc/mutex/parallelism=32-24         	  313561	      4045 ns/op
BenchmarkInc/mutex/parallelism=64
BenchmarkInc/mutex/parallelism=64-24         	  144046	      8279 ns/op

Also, today all metrics returned from Engine.GetMetrics() are gauges because all the metrics reported by Pebble are cumulative, not deltas, even if internally they're monotonically increasing counters.

I goofed and wasn't propagating the metrics in Engine.GetMetrics(), so I made that change too.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick and @tbg)

pkg/storage/engine.go line 1047 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

It would be helpful to reference this type in the help text for each of the metrics as a pointer for more detailed semantics of what's tracked within each.

Done.

pkg/storage/engine.go line 1074 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"If this is high..." could you copy this to the metric help text?

Done.

pkg/storage/engine.go line 1082 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

ditto about "If this is high..."

Done.

pkg/storage/pebble_iterator.go line 52 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

and it can't be nil, right?

Done.

nicktrav · 2023-03-28T22:18:24Z

Thoughts on adding some / all of these metrics to the telemetry event pipeline? Would they be helpful for observing long term trends or identifying areas for investigation / improvement?

Could certainly be added in a follow-up.

tbg

eg, running this benchmark on my gceworker:

I dug a little deeper on this benchmark. I don't think a parallelism of 64 is sufficient - since this is in the read hot path, we have to think about 1000s of ops/sec, so I extended to high concurrencies (and thinned it out a bit).
Second, I think the benchmark was doing it wrong? It was running parallelism*b.N operations but the benchmark harness assumed it was b.N. So we'd really have to divide all reported ops/sec numbers by parallelism. I cleaned that all up a bit (using b.RunParallel in the benchmark and running only b.N ops each). I also added a sharded mutex (GOMAXPROCS shards). The resulting numbers (gceworker) are... surprising, the standard mutex actually beats the other options and actually gets ... better? It's almost as though queuing up behind a single mutex is more efficient than shorter queues across multiple mutexes. I wonder if there's something smart going on in the scheduler, or something dumb in my code.

https://github.com/tbg/goplay/blob/master/mutex/mutex_test.go
https://docs.google.com/spreadsheets/d/1B9g1ZfzjhpZefJr9RlZWKJ3XvNiokHW10M8q-Eb7--A/edit?usp=sharing

I suppose if this new mutex does cause an issue it is best observed in roachperf rather than embarking on discussion too focused on microbenchmarks.

Reviewed 5 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick and @jbowens)

pkg/kv/kvserver/metrics.go line 2473 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I was expecting them all to be counters. They are all gauges. What's going on?

Could you add a comment that these are cumulative values, but tracked as gauges because the cumulative counters mostly live in pebble (though this isn't always true, it's more convenient to stick to the pattern throughout), at the top of this block (i.e. not on each metric)?

I also wonder (not for this PR) if we can't use counters here regardless. A gauge and a counter get different prometheus metric types, and there are consequences of this type. For example, plotting a counter on grafana, it will remind you that this is a counter and propose looking at the rate(). Having a gauge that is really a counter is unintuitive.

I think the only reason we're using gauges here is because we don't have a .Update method on Counter. I think this could be added and all of these metrics ported. Filed #99922, perhaps mention it in the comment requested above.

nvanbenschoten · 2023-03-29T13:50:46Z

pkg/storage/pebble.go

@@ -1338,13 +1357,13 @@ func (p *Pebble) NewMVCCIterator(iterKind MVCCIterKind, opts IterOptions) MVCCIt
 		return maybeWrapInUnsafeIter(iter)
 	}

-	iter := newPebbleIterator(p.db, opts, StandardDurability)
+	iter := newPebbleIterator(p.db, opts, StandardDurability, p.aggregateIterStats)


Drive-by comment: does passing this method reference to newPebbleIterator result in a new heap allocation? If so, consider replacing func(IteratorStats) with an interface that is implemented by Pebble. p *Pebble is already on the heap, so that approach would not heap allocation.

RaduBerinde

Not that SetParallelism sets the number of goroutines used by RunParallel to p*GOMAXPROCS so the parallelism you are looking at is way too high

I have a feeling that the atomic would fare relatively worse if the values were not packed close together in 1-2 cachelines.

It might be more intuitive to show a graph of throughput (total number of ops per second). That one should grow ~linearly up to the number of CPUs then level off and start going down a bit. I think it would also show that the number is so much higher than what we're actually doing in CRDB that this is mostly academic 😂

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @jbowens, and @tbg)

jbowens

It was running parallelism*b.N operations but the benchmark harness assumed it was b.N. So we'd really have to divide all reported ops/sec numbers by parallelism

Makes sense, but it doesn't impact the relative numbers between mutex and atomics.

I don't think a parallelism of 64 is sufficient - since this is in the read hot path, we have to think about 1000s of ops/sec,

There would be need to millions of iterators closing in parallel to see 1000s of iterators aggregating stats in parallel. Since the critical section is tiny, my expectation is that higher parallelism beyond the expected number of CPU threads won't matter. The probability of any individual aggregateIterStats call being context switched away while in the critical section is incredibly small, and IIRC the golang mutex has an optimistic spin-loop fast path for very short critical sections like this.

I think we see the effect of the cache line invalidation on every individual atomic increment. At the extreme parallelism counts, every increment can expect a cache miss. In contrast, the mutex suffers the cost of the write barrier and cache miss once, on mutex acquisition.

FWIW, iterator construction also acquires an Engine-wide RWMutex in order to acquire the current version of the LSM. We've never seen contention there. (We typically see it on the block cache shards, which see much higher ops/sec and are sharded 2*num procs: cockroachdb/pebble#1997).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/metrics.go line 2473 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you add a comment that these are cumulative values, but tracked as gauges because the cumulative counters mostly live in pebble (though this isn't always true, it's more convenient to stick to the pattern throughout), at the top of this block (i.e. not on each metric)?

I also wonder (not for this PR) if we can't use counters here regardless. A gauge and a counter get different prometheus metric types, and there are consequences of this type. For example, plotting a counter on grafana, it will remind you that this is a counter and propose looking at the rate(). Having a gauge that is really a counter is unintuitive.

I think the only reason we're using gauges here is because we don't have a .Update method on Counter. I think this could be added and all of these metrics ported. Filed #99922, perhaps mention it in the comment requested above.

Added a comment here, and updated the existing comment above the struct comments to reference #99922.

pkg/storage/pebble.go line 1360 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Drive-by comment: does passing this method reference to newPebbleIterator result in a new heap allocation? If so, consider replacing func(IteratorStats) with an interface that is implemented by Pebble. p *Pebble is already on the heap, so that approach would not heap allocation.

Good catch, it does. Fixed.

tbg · 2023-03-29T19:04:58Z

^-- hiding the above, @RaduBerinde pointed out the numbers are still garbage - it's hard to get something useful out of the go bench harness, @jbowens' initial attempt was still the best shot. I went looking upstream too but they only have a very basic mutex benchmark. I suppose it doesn't matter, since we're confident there isn't going to be noticeable contention. Sorry about the noise!

tbg

LGTM (but note the eng-wide ask to not bors anything right now)

Reviewed 10 of 10 files at r5, 11 of 11 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick and @nvanbenschoten)

sumeerbhola

Reviewed 1 of 5 files at r1, 8 of 10 files at r5, 11 of 11 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bananabrick and @nvanbenschoten)

jbowens · 2023-03-31T18:06:08Z

TFTRs!

bors r+

craig · 2023-03-31T19:26:27Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-31T19:58:40Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-31T21:22:48Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-31T22:10:56Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-04-01T01:01:20Z

Build failed:

Bazel Essential CI (Cockroach)

Previously, the EngineIterator interface exposed a GetRawIter method to retrieve the underlying Pebble iterator. This existed only to facilitate iterator clones and was intended to only ever be consumed by the storage package itself. This commit replaces GetRawIter with a CloneContext method that returns an opaque CloneContext type that contains the raw iterator. This prevents external packages from directly using the raw Pebble iterator. It also prepares for the introduction of aggregation of iterator stats, providing an envelope for propagating information on where to aggregate iterator stats when the clonee iterator closes. Epic: None Release note: None

Aggregate the iterator stats across all of an engine's iterators. Expose seven new timeseries metrics for visibility into the behavior of storage engine iterators: - storage.iterator.block-load.bytes - storage.iterator.block-load.cached-bytes - storage.iterator.block-load.read-duration - storage.iterator.external.seeks - storage.iterator.external.steps - storage.iterator.internal.seeks - storage.iterator.internal.steps Close cockroachdb#95790. Epic: None Release note (ops change): Introduces seven new timeseries metrics for better visibility into the behavior of storage engine iterators and their internals.

jbowens · 2023-04-02T23:12:35Z

bors r+

craig · 2023-04-02T23:36:34Z

Build failed:

Bazel Essential CI (Cockroach)

craig · 2023-04-03T02:49:14Z

Build failed:

Bazel Essential CI (Cockroach)

craig · 2023-04-03T03:32:32Z

Build failed:

Bazel Essential CI (Cockroach)

jbowens · 2023-04-03T04:37:24Z

bors retry

craig · 2023-04-03T05:07:46Z

Build succeeded:

Bazel Essential CI (Cockroach)

RaduBerinde · 2023-12-07T18:17:20Z

pkg/storage/bench_pebble_test.go

@@ -223,11 +223,11 @@ func BenchmarkMVCCGet_Pebble(b *testing.B) {
 	ctx := context.Background()
 	for _, batch := range []bool{false, true} {
 		b.Run(fmt.Sprintf("batch=%t", batch), func(b *testing.B) {
-			for _, numVersions := range []int{1, 10, 100} {
+			for _, numVersions := range []int{10} {


@jbowens did you intend to merge this change to the benchmark?

jbowens added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Mar 27, 2023

jbowens force-pushed the aggregate-iterator-stats branch 2 times, most recently from a9f4953 to d725ca9 Compare March 28, 2023 14:57

jbowens marked this pull request as ready for review March 28, 2023 15:27

jbowens requested a review from a team March 28, 2023 15:27

jbowens requested review from a team as code owners March 28, 2023 15:27

jbowens requested review from bananabrick and tbg March 28, 2023 15:27

tbg reviewed Mar 28, 2023

View reviewed changes

jbowens force-pushed the aggregate-iterator-stats branch from d725ca9 to f71a4db Compare March 28, 2023 22:09

jbowens commented Mar 28, 2023

View reviewed changes

jbowens force-pushed the aggregate-iterator-stats branch from f71a4db to d901bc0 Compare March 28, 2023 22:23

tbg mentioned this pull request Mar 29, 2023

metric: allow updating counter value #99922

Closed

tbg approved these changes Mar 29, 2023

View reviewed changes

nvanbenschoten reviewed Mar 29, 2023

View reviewed changes

RaduBerinde reviewed Mar 29, 2023

View reviewed changes

jbowens force-pushed the aggregate-iterator-stats branch from d901bc0 to 5b6f6f8 Compare March 29, 2023 15:10

jbowens commented Mar 29, 2023

View reviewed changes

jbowens force-pushed the aggregate-iterator-stats branch from 5b6f6f8 to 0758cc7 Compare March 29, 2023 15:17

This comment was marked as outdated.

Sign in to view

tbg approved these changes Mar 29, 2023

View reviewed changes

sumeerbhola approved these changes Mar 30, 2023

View reviewed changes

tbg mentioned this pull request Mar 30, 2023

storage: Performance degradation caused by kv tombstones #17229

Open

jbowens force-pushed the aggregate-iterator-stats branch from 0758cc7 to 14574ff Compare March 30, 2023 20:36

jbowens added 2 commits April 2, 2023 11:26

jbowens force-pushed the aggregate-iterator-stats branch from 14574ff to e1254a8 Compare April 2, 2023 15:26

This comment was marked as duplicate.

Sign in to view

craig bot merged commit bebd2dd into cockroachdb:master Apr 3, 2023

blathers-crl bot mentioned this pull request Apr 3, 2023

release-23.1: storage: aggregate iterator stats #100445

Merged

jbowens deleted the aggregate-iterator-stats branch April 3, 2023 15:52

cockroach-teamcity mentioned this pull request Apr 4, 2023

PR #99726 - storage: aggregate iterator stats cockroachdb/docs#16675

Open

RaduBerinde reviewed Dec 7, 2023

View reviewed changes

RaduBerinde mentioned this pull request Dec 7, 2023

storage: improve mvcc benchmarks under -short #115810

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: aggregate iterator stats #99726

storage: aggregate iterator stats #99726

jbowens commented Mar 27, 2023

cockroach-teamcity commented Mar 27, 2023

tbg left a comment •

edited

Loading

tbg Mar 28, 2023

tbg Mar 28, 2023

tbg Mar 28, 2023

tbg Mar 28, 2023

tbg Mar 28, 2023

jbowens left a comment

nicktrav commented Mar 28, 2023 •

edited

Loading

tbg left a comment

nvanbenschoten Mar 29, 2023

RaduBerinde left a comment

jbowens left a comment

This comment was marked as outdated.

tbg commented Mar 29, 2023

tbg left a comment

sumeerbhola left a comment

jbowens commented Mar 31, 2023

craig bot commented Mar 31, 2023

craig bot commented Mar 31, 2023

craig bot commented Mar 31, 2023

craig bot commented Mar 31, 2023

craig bot commented Apr 1, 2023

jbowens commented Apr 2, 2023

craig bot commented Apr 2, 2023

This comment was marked as duplicate.

craig bot commented Apr 3, 2023

This comment was marked as duplicate.

craig bot commented Apr 3, 2023

jbowens commented Apr 3, 2023

craig bot commented Apr 3, 2023

RaduBerinde Dec 7, 2023

storage: aggregate iterator stats #99726

storage: aggregate iterator stats #99726

Conversation

jbowens commented Mar 27, 2023

cockroach-teamcity commented Mar 27, 2023

tbg left a comment • edited Loading

Choose a reason for hiding this comment

tbg Mar 28, 2023

Choose a reason for hiding this comment

tbg Mar 28, 2023

Choose a reason for hiding this comment

tbg Mar 28, 2023

Choose a reason for hiding this comment

tbg Mar 28, 2023

Choose a reason for hiding this comment

tbg Mar 28, 2023

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav commented Mar 28, 2023 • edited Loading

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten Mar 29, 2023

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

tbg commented Mar 29, 2023

tbg left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens commented Mar 31, 2023

craig bot commented Mar 31, 2023

craig bot commented Mar 31, 2023

craig bot commented Mar 31, 2023

craig bot commented Mar 31, 2023

craig bot commented Apr 1, 2023

jbowens commented Apr 2, 2023

craig bot commented Apr 2, 2023

This comment was marked as duplicate.

craig bot commented Apr 3, 2023

This comment was marked as duplicate.

craig bot commented Apr 3, 2023

jbowens commented Apr 3, 2023

craig bot commented Apr 3, 2023

RaduBerinde Dec 7, 2023

Choose a reason for hiding this comment

tbg left a comment •

edited

Loading

nicktrav commented Mar 28, 2023 •

edited

Loading