rac2: add token counter and stream metrics #129350

kvoli · 2024-08-20T19:23:54Z

This commit introduces metrics related to stream eval tokens and stream
send tokens. Hooking up these metrics to the registry will be in a
subsequent commit.

There are two separate metric structs used:

tokenCounterMetrics, which only contains counter and is shared
among all tokenCounters on the same node. Each tokenCounter
updates the shared counters after adjust is called.
tokenStreamMetrics, which is updated periodically by calling
UpdateMetricGauges via the StreamTokenCounterProvider, which is
one per node.

Metrics related to WaitForEval (as well as blocked stream logging) are
also deferred to a subsequent commit.

Part of: #128031
Release note: None

blathers-crl · 2024-08-20T19:23:58Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2024-08-20T19:24:06Z

This change is

blathers-crl · 2024-08-22T14:54:54Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

kvoli · 2024-08-26T22:11:29Z

Open to suggestions for the migration. I was planning on just hooking these up and having both metrics be enabled at once. We could reuse the existing metrics with some heavy refactoring, if that were desirable.

sumeerbhola

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go line 161 at r2 (raw file):

	ElasticFlowTokensDeducted    *metric.Counter
	ElasticFlowTokensReturned    *metric.Counter
	ElasticFlowTokensUnaccounted *metric.Counter

is there a reason these are not arrays too?

btw, I am not a fan of the callback based pattern of adjustment that was used in the RACv1 code -- it is error prone and breaks abstraction boundaries. The two patterns I prefer:

The kvserver metrics pattern, where something calls the various components periodically, extracts stats structs and updates both gauges and counters. It does require more work to implement, since the various components need to provide their cumulative stats. In cases like Pebble this was already the case.
The data-structures are responsible for updating the (sometimes shared) cumulative metrics when they update their state. This distributes the metrics update to where the state updates are happening which IMO is desirable. Less abstraction leakage.

The other minor nit with the RACv1 structuring is that it prefixed everything with kvadmission.flow_controller. RACv1 was not in the kvadmission package, and some metrics relate to token counters and some relate to waiting etc. I think we need to address this since we now have both eval and send tokens and concepts like WaitDuration aren't necessarily relevant for send tokens, and they are relevant to WaitForEval. Also, since we have both eval and send tokens, I withdraw my earlier position that we should share metrics with RACv1. There was a valid point made yesterday by Andrew about not naming these v2.

Then have a struct like

type tokenCounterMetrics {
   deducted [NumWorkClasses]*metric.Counter
   returned ...
   ...
}

containing all the cumulative metrics that a TokenCounter needs to adjust. Provide the TokenCounter the struct when creating it in StreamTokenCounterProvider. StreamTokenCounterProvider is the one that will interface with a registry since there is only one provider is a node.
Additionally StreamTokenCounterProvider will have a separate struct for gauge metrics (all functional gauges) that require iteration over the TokenCounters (e.g. stream count, available tokens).

All the StreamTokenCounterProvider and TokenCounter changes belong in one PR.

The remaining are eval wait metrics, which could all be named with the kvflowcontrol.eval_wait prefix. We could possibly update them in RangeController.WaitForEval. RangeController would expect a metrics struct to be provided to it as part of its options.

pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 37 at r1 (raw file):

// Eval returns the evaluation token counter for the given stream.
func (p *StreamTokenCounterProvider) Eval(stream kvflowcontrol.Stream) *TokenCounter {
	t, _ := p.evalCounters.LoadOrStore(stream, NewTokenCounter(p.settings))

We shouldn't be using LoadOrStore directly since NewTokenCounter is expensive (it allocates). The usual pattern is to first call Load and if it is not found, call LoadOrStore. And if we expect a high concurrency in those that fail the Load, and the new is very expensive, do a creation mutex acquisition, call Load a second time, and if it fails call LoadOrStore while holding the mutex. In this case I think that is not necessary.

pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 43 at r1 (raw file):

// Send returns the send token counter for the given stream.
func (p *StreamTokenCounterProvider) Send(stream kvflowcontrol.Stream) *TokenCounter {
	t, _ := p.sendCounters.LoadOrStore(stream, NewTokenCounter(p.settings))

ditto

pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 144 at r1 (raw file):

// kvflowcontrol.Stream. It's used to synchronize handoff between threads
// returning and waiting for flow tokens.
type TokenCounter struct {

is this exported since it is getting called outside the package (or will be)?

kvoli

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)

pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go line 161 at r2 (raw file):

Previously, sumeerbhola wrote…

is there a reason these are not arrays too?

btw, I am not a fan of the callback based pattern of adjustment that was used in the RACv1 code -- it is error prone and breaks abstraction boundaries. The two patterns I prefer:

The kvserver metrics pattern, where something calls the various components periodically, extracts stats structs and updates both gauges and counters. It does require more work to implement, since the various components need to provide their cumulative stats. In cases like Pebble this was already the case.

The data-structures are responsible for updating the (sometimes shared) cumulative metrics when they update their state. This distributes the metrics update to where the state updates are happening which IMO is desirable. Less abstraction leakage.

The other minor nit with the RACv1 structuring is that it prefixed everything with kvadmission.flow_controller. RACv1 was not in the kvadmission package, and some metrics relate to token counters and some relate to waiting etc. I think we need to address this since we now have both eval and send tokens and concepts like WaitDuration aren't necessarily relevant for send tokens, and they are relevant to WaitForEval. Also, since we have both eval and send tokens, I withdraw my earlier position that we should share metrics with RACv1. There was a valid point made yesterday by Andrew about not naming these v2.

So I'd suggest a naming scheme that segments out the TokenCounter metrics, such as
kvflowcontrol.tokens.<eval|send>.<regular|elastic>.<deducted|returned|unaccounted|available>

Then have a struct like
type tokenCounterMetrics {
   deducted [NumWorkClasses]*metric.Counter
   returned ...
   ...
}
containing all the cumulative metrics that a TokenCounter needs to adjust. Provide the TokenCounter the struct when creating it in StreamTokenCounterProvider. StreamTokenCounterProvider is the one that will interface with a registry since there is only one provider is a node.
Additionally StreamTokenCounterProvider will have a separate struct for gauge metrics (all functional gauges) that require iteration over the TokenCounters (e.g. stream count, available tokens).

All the StreamTokenCounterProvider and TokenCounter changes belong in one PR.

The remaining are eval wait metrics, which could all be named with the kvflowcontrol.eval_wait prefix. We could possibly update them in RangeController.WaitForEval. RangeController would expect a metrics struct to be provided to it as part of its options.

TYFTR

I'll go through and do some refactoring. Agree on most of these points, these diffs were taken partially as is from the prototype which did the same thing with the existing v1 metrics.

kvoli

Should be good for another round @sumeerbhola.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)

pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go line 161 at r2 (raw file):

Previously, kvoli (Austen) wrote…

TYFTR

I'll go through and do some refactoring. Agree on most of these points, these diffs were taken partially as is from the prototype which did the same thing with the existing v1 metrics.

Updated to slim down this PR to just the token/stream metrics. The logging, hookup and eval metrics will be in separate PRs, ontop of this.

pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 37 at r1 (raw file):

Previously, sumeerbhola wrote…

We shouldn't be using LoadOrStore directly since NewTokenCounter is expensive (it allocates). The usual pattern is to first call Load and if it is not found, call LoadOrStore. And if we expect a high concurrency in those that fail the Load, and the new is very expensive, do a creation mutex acquisition, call Load a second time, and if it fails call LoadOrStore while holding the mutex. In this case I think that is not necessary.

Done.

pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 43 at r1 (raw file):

Previously, sumeerbhola wrote…

ditto

Done.

pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 144 at r1 (raw file):

Previously, sumeerbhola wrote…

is this exported since it is getting called outside the package (or will be)?

No reason it should be, changed to private.

sumeerbhola

Reviewed 1 of 8 files at r3, 7 of 7 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 81 at r4 (raw file):

		return func(stream kvflowcontrol.Stream, t *tokenCounter) bool {
			count[metricType][regular]++
			count[flowControlEvalMetricType][elastic]++

why is this not indexed by metricType. Same question below.

pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 97 at r4 (raw file):

	p.evalCounters.Range(gaugeUpdateFn(flowControlEvalMetricType))
	p.sendCounters.Range(gaugeUpdateFn(flowControlEvalMetricType))

shouldn't this be flowControlSendMetricType?

Worth adding some test cases that would find bugs.

pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 93 at r4 (raw file):

type deltaStats struct {
	noTokenDuration                time.Duration
	tokensDeducted, tokensReturned kvflowcontrol.Tokens

we don't seem to use these tokensDeducted and tokensReturned stats. Was this carried over from v1 because of the different way it was doing cumulative metrics?

pkg/kv/kvserver/kvflowcontrol/rac2/token_counter_test.go line 44 at r4 (raw file):

}

func TestTokenAdjustment(t *testing.T) {

is it easy to check the metrics values in a few places in the tests in this file?

kvoli

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)

pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 81 at r4 (raw file):

Previously, sumeerbhola wrote…

why is this not indexed by metricType. Same question below.

Bug, fixed.

pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 97 at r4 (raw file):

Previously, sumeerbhola wrote…

shouldn't this be flowControlSendMetricType?

Worth adding some test cases that would find bugs.

It should be. Added some tests.

pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 93 at r4 (raw file):

Previously, sumeerbhola wrote…

we don't seem to use these tokensDeducted and tokensReturned stats. Was this carried over from v1 because of the different way it was doing cumulative metrics?

These will be used for blocked stream logging, I added them in as part of the deltaStats but only the noTokenDuration is being used atm. I can remove these from the PR and move to follow up w/ the logging but would prefer to keep them otherwise.

pkg/kv/kvserver/kvflowcontrol/rac2/token_counter_test.go line 44 at r4 (raw file):

Previously, sumeerbhola wrote…

is it easy to check the metrics values in a few places in the tests in this file?

Yeah pretty easy. Added.

kvoli · 2024-08-29T17:40:44Z

Should be ready for another look @sumeerbhola.

Prior to this change, `TokenCounter` provided an interface implemented by `*tokenCounter`. As there is only one implementation, de-interface `TokenCounter`. Also, store the TokenCounter in a `syncutil.Map`, as opposed to a native mutex protected map. Epic: CRDB-37515 Release note: None

This commit introduces metrics related to stream eval tokens and stream send tokens. Hooking up these metrics to the registry will be in a subsequent commit. There are two separate metric structs used: 1. `tokenCounterMetrics`, which only contains counter and is shared among all `tokenCounter`s on the same node. Each `tokenCounter` updates the shared counters after `adjust` is called. 2. `tokenStreamMetrics`, which is updated periodically by calling `UpdateMetricGauges` via the `StreamTokenCounterProvider`, which is one per node. Metrics related to `WaitForEval` (as well as blocked stream logging) are also deferred to a subsequent commit. Part of: cockroachdb#128031 Release note: None

sumeerbhola

Reviewed 1 of 8 files at r7, 7 of 7 files at r8, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 93 at r4 (raw file):

Previously, kvoli (Austen) wrote…

These will be used for blocked stream logging, I added them in as part of the deltaStats but only the noTokenDuration is being used atm. I can remove these from the PR and move to follow up w/ the logging but would prefer to keep them otherwise.

Fine to keep them

kvoli · 2024-08-30T15:46:22Z

TYFTR!

bors r=sumeerbhola

craig · 2024-08-30T16:14:17Z

Build succeeded:

kvoli force-pushed the 240820.rac-metrics branch from 9effb64 to c60b8e8 Compare August 20, 2024 19:25

kvoli changed the title ~~rac2: implement wait for eval~~ .*: [dnm] add flow control v2 metrics Aug 20, 2024

kvoli force-pushed the 240820.rac-metrics branch from c60b8e8 to 10409e4 Compare August 22, 2024 14:54

kvoli force-pushed the 240820.rac-metrics branch from 10409e4 to 9e46ab4 Compare August 26, 2024 20:44

kvoli changed the title ~~.*: [dnm] add flow control v2 metrics~~ rac2: add metrics Aug 26, 2024

kvoli force-pushed the 240820.rac-metrics branch 2 times, most recently from f6cb7f9 to c39768b Compare August 26, 2024 22:02

kvoli self-assigned this Aug 26, 2024

kvoli force-pushed the 240820.rac-metrics branch from c39768b to 4ae6f67 Compare August 26, 2024 22:04

kvoli marked this pull request as ready for review August 26, 2024 22:28

kvoli requested a review from a team as a code owner August 26, 2024 22:28

kvoli requested review from pav-kv and sumeerbhola August 26, 2024 22:28

sumeerbhola requested changes Aug 28, 2024

View reviewed changes

kvoli commented Aug 28, 2024

View reviewed changes

kvoli force-pushed the 240820.rac-metrics branch 2 times, most recently from 4b332f6 to 3af02c8 Compare August 28, 2024 18:28

kvoli changed the title ~~rac2: add metrics~~ rac2: add token counter and stream metrics Aug 28, 2024

kvoli requested a review from sumeerbhola August 28, 2024 18:32

kvoli commented Aug 28, 2024

View reviewed changes

kvoli force-pushed the 240820.rac-metrics branch from 3af02c8 to ede81ad Compare August 28, 2024 19:29

sumeerbhola requested changes Aug 28, 2024

View reviewed changes

kvoli force-pushed the 240820.rac-metrics branch from ede81ad to 447c3fc Compare August 29, 2024 17:34

kvoli commented Aug 29, 2024

View reviewed changes

kvoli requested a review from sumeerbhola August 29, 2024 17:35

kvoli force-pushed the 240820.rac-metrics branch 2 times, most recently from 7e4977e to a99fcaf Compare August 29, 2024 20:49

kvoli added 2 commits August 29, 2024 18:21

kvoli force-pushed the 240820.rac-metrics branch from a99fcaf to 5d555ec Compare August 29, 2024 22:26

kvoli mentioned this pull request Aug 29, 2024

rac2: add eval wait metrics #129911

Merged

sumeerbhola approved these changes Aug 30, 2024

View reviewed changes

craig bot merged commit 3373e5d into cockroachdb:master Aug 30, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rac2: add token counter and stream metrics #129350

rac2: add token counter and stream metrics #129350

kvoli commented Aug 20, 2024 •

edited

Loading

blathers-crl bot commented Aug 20, 2024

cockroach-teamcity commented Aug 20, 2024

blathers-crl bot commented Aug 22, 2024

kvoli commented Aug 26, 2024

sumeerbhola left a comment

kvoli left a comment

kvoli left a comment

sumeerbhola left a comment

kvoli left a comment

kvoli commented Aug 29, 2024

sumeerbhola left a comment

kvoli commented Aug 30, 2024

craig bot commented Aug 30, 2024

rac2: add token counter and stream metrics #129350

rac2: add token counter and stream metrics #129350

Conversation

kvoli commented Aug 20, 2024 • edited Loading

blathers-crl bot commented Aug 20, 2024

cockroach-teamcity commented Aug 20, 2024

blathers-crl bot commented Aug 22, 2024

kvoli commented Aug 26, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

kvoli commented Aug 29, 2024

sumeerbhola left a comment

Choose a reason for hiding this comment

kvoli commented Aug 30, 2024

craig bot commented Aug 30, 2024

kvoli commented Aug 20, 2024 •

edited

Loading