Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rac2: add token counter and stream metrics #129350

Merged
merged 2 commits into from
Aug 30, 2024

Conversation

kvoli
Copy link
Collaborator

@kvoli kvoli commented Aug 20, 2024

This commit introduces metrics related to stream eval tokens and stream
send tokens. Hooking up these metrics to the registry will be in a
subsequent commit.

There are two separate metric structs used:

  1. tokenCounterMetrics, which only contains counter and is shared
    among all tokenCounters on the same node. Each tokenCounter
    updates the shared counters after adjust is called.
  2. tokenStreamMetrics, which is updated periodically by calling
    UpdateMetricGauges via the StreamTokenCounterProvider, which is
    one per node.

Metrics related to WaitForEval (as well as blocked stream logging) are
also deferred to a subsequent commit.

Part of: #128031
Release note: None

Copy link

blathers-crl bot commented Aug 20, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli force-pushed the 240820.rac-metrics branch from 9effb64 to c60b8e8 Compare August 20, 2024 19:25
@kvoli kvoli changed the title rac2: implement wait for eval .*: [dnm] add flow control v2 metrics Aug 20, 2024
@kvoli kvoli force-pushed the 240820.rac-metrics branch from c60b8e8 to 10409e4 Compare August 22, 2024 14:54
Copy link

blathers-crl bot commented Aug 22, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kvoli kvoli force-pushed the 240820.rac-metrics branch from 10409e4 to 9e46ab4 Compare August 26, 2024 20:44
@kvoli kvoli changed the title .*: [dnm] add flow control v2 metrics rac2: add metrics Aug 26, 2024
@kvoli kvoli force-pushed the 240820.rac-metrics branch 2 times, most recently from f6cb7f9 to c39768b Compare August 26, 2024 22:02
@kvoli kvoli self-assigned this Aug 26, 2024
@kvoli kvoli force-pushed the 240820.rac-metrics branch from c39768b to 4ae6f67 Compare August 26, 2024 22:04
@kvoli
Copy link
Collaborator Author

kvoli commented Aug 26, 2024

Open to suggestions for the migration. I was planning on just hooking these up and having both metrics be enabled at once. We could reuse the existing metrics with some heavy refactoring, if that were desirable.

@kvoli kvoli marked this pull request as ready for review August 26, 2024 22:28
@kvoli kvoli requested a review from a team as a code owner August 26, 2024 22:28
@kvoli kvoli requested review from pav-kv and sumeerbhola August 26, 2024 22:28
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go line 161 at r2 (raw file):

	ElasticFlowTokensDeducted    *metric.Counter
	ElasticFlowTokensReturned    *metric.Counter
	ElasticFlowTokensUnaccounted *metric.Counter

is there a reason these are not arrays too?

btw, I am not a fan of the callback based pattern of adjustment that was used in the RACv1 code -- it is error prone and breaks abstraction boundaries. The two patterns I prefer:

  • The kvserver metrics pattern, where something calls the various components periodically, extracts stats structs and updates both gauges and counters. It does require more work to implement, since the various components need to provide their cumulative stats. In cases like Pebble this was already the case.
  • The data-structures are responsible for updating the (sometimes shared) cumulative metrics when they update their state. This distributes the metrics update to where the state updates are happening which IMO is desirable. Less abstraction leakage.

The other minor nit with the RACv1 structuring is that it prefixed everything with kvadmission.flow_controller. RACv1 was not in the kvadmission package, and some metrics relate to token counters and some relate to waiting etc. I think we need to address this since we now have both eval and send tokens and concepts like WaitDuration aren't necessarily relevant for send tokens, and they are relevant to WaitForEval. Also, since we have both eval and send tokens, I withdraw my earlier position that we should share metrics with RACv1. There was a valid point made yesterday by Andrew about not naming these v2.

So I'd suggest a naming scheme that segments out the TokenCounter metrics, such as
kvflowcontrol.tokens.<eval|send>.<regular|elastic>.<deducted|returned|unaccounted|available>

Then have a struct like

type tokenCounterMetrics {
   deducted [NumWorkClasses]*metric.Counter
   returned ...
   ...
}

containing all the cumulative metrics that a TokenCounter needs to adjust. Provide the TokenCounter the struct when creating it in StreamTokenCounterProvider. StreamTokenCounterProvider is the one that will interface with a registry since there is only one provider is a node.
Additionally StreamTokenCounterProvider will have a separate struct for gauge metrics (all functional gauges) that require iteration over the TokenCounters (e.g. stream count, available tokens).

All the StreamTokenCounterProvider and TokenCounter changes belong in one PR.

The remaining are eval wait metrics, which could all be named with the kvflowcontrol.eval_wait prefix. We could possibly update them in RangeController.WaitForEval. RangeController would expect a metrics struct to be provided to it as part of its options.


pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 37 at r1 (raw file):

// Eval returns the evaluation token counter for the given stream.
func (p *StreamTokenCounterProvider) Eval(stream kvflowcontrol.Stream) *TokenCounter {
	t, _ := p.evalCounters.LoadOrStore(stream, NewTokenCounter(p.settings))

We shouldn't be using LoadOrStore directly since NewTokenCounter is expensive (it allocates). The usual pattern is to first call Load and if it is not found, call LoadOrStore. And if we expect a high concurrency in those that fail the Load, and the new is very expensive, do a creation mutex acquisition, call Load a second time, and if it fails call LoadOrStore while holding the mutex. In this case I think that is not necessary.


pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 43 at r1 (raw file):

// Send returns the send token counter for the given stream.
func (p *StreamTokenCounterProvider) Send(stream kvflowcontrol.Stream) *TokenCounter {
	t, _ := p.sendCounters.LoadOrStore(stream, NewTokenCounter(p.settings))

ditto


pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 144 at r1 (raw file):

// kvflowcontrol.Stream. It's used to synchronize handoff between threads
// returning and waiting for flow tokens.
type TokenCounter struct {

is this exported since it is getting called outside the package (or will be)?

Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go line 161 at r2 (raw file):

Previously, sumeerbhola wrote…

is there a reason these are not arrays too?

btw, I am not a fan of the callback based pattern of adjustment that was used in the RACv1 code -- it is error prone and breaks abstraction boundaries. The two patterns I prefer:

  • The kvserver metrics pattern, where something calls the various components periodically, extracts stats structs and updates both gauges and counters. It does require more work to implement, since the various components need to provide their cumulative stats. In cases like Pebble this was already the case.
  • The data-structures are responsible for updating the (sometimes shared) cumulative metrics when they update their state. This distributes the metrics update to where the state updates are happening which IMO is desirable. Less abstraction leakage.

The other minor nit with the RACv1 structuring is that it prefixed everything with kvadmission.flow_controller. RACv1 was not in the kvadmission package, and some metrics relate to token counters and some relate to waiting etc. I think we need to address this since we now have both eval and send tokens and concepts like WaitDuration aren't necessarily relevant for send tokens, and they are relevant to WaitForEval. Also, since we have both eval and send tokens, I withdraw my earlier position that we should share metrics with RACv1. There was a valid point made yesterday by Andrew about not naming these v2.

So I'd suggest a naming scheme that segments out the TokenCounter metrics, such as
kvflowcontrol.tokens.<eval|send>.<regular|elastic>.<deducted|returned|unaccounted|available>

Then have a struct like

type tokenCounterMetrics {
   deducted [NumWorkClasses]*metric.Counter
   returned ...
   ...
}

containing all the cumulative metrics that a TokenCounter needs to adjust. Provide the TokenCounter the struct when creating it in StreamTokenCounterProvider. StreamTokenCounterProvider is the one that will interface with a registry since there is only one provider is a node.
Additionally StreamTokenCounterProvider will have a separate struct for gauge metrics (all functional gauges) that require iteration over the TokenCounters (e.g. stream count, available tokens).

All the StreamTokenCounterProvider and TokenCounter changes belong in one PR.

The remaining are eval wait metrics, which could all be named with the kvflowcontrol.eval_wait prefix. We could possibly update them in RangeController.WaitForEval. RangeController would expect a metrics struct to be provided to it as part of its options.

TYFTR

I'll go through and do some refactoring. Agree on most of these points, these diffs were taken partially as is from the prototype which did the same thing with the existing v1 metrics.

@kvoli kvoli force-pushed the 240820.rac-metrics branch 2 times, most recently from 4b332f6 to 3af02c8 Compare August 28, 2024 18:28
@kvoli kvoli changed the title rac2: add metrics rac2: add token counter and stream metrics Aug 28, 2024
@kvoli kvoli requested a review from sumeerbhola August 28, 2024 18:32
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good for another round @sumeerbhola.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go line 161 at r2 (raw file):

Previously, kvoli (Austen) wrote…

TYFTR

I'll go through and do some refactoring. Agree on most of these points, these diffs were taken partially as is from the prototype which did the same thing with the existing v1 metrics.

Updated to slim down this PR to just the token/stream metrics. The logging, hookup and eval metrics will be in separate PRs, ontop of this.


pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 37 at r1 (raw file):

Previously, sumeerbhola wrote…

We shouldn't be using LoadOrStore directly since NewTokenCounter is expensive (it allocates). The usual pattern is to first call Load and if it is not found, call LoadOrStore. And if we expect a high concurrency in those that fail the Load, and the new is very expensive, do a creation mutex acquisition, call Load a second time, and if it fails call LoadOrStore while holding the mutex. In this case I think that is not necessary.

Done.


pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 43 at r1 (raw file):

Previously, sumeerbhola wrote…

ditto

Done.


pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 144 at r1 (raw file):

Previously, sumeerbhola wrote…

is this exported since it is getting called outside the package (or will be)?

No reason it should be, changed to private.

@kvoli kvoli force-pushed the 240820.rac-metrics branch from 3af02c8 to ede81ad Compare August 28, 2024 19:29
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r3, 7 of 7 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 81 at r4 (raw file):

		return func(stream kvflowcontrol.Stream, t *tokenCounter) bool {
			count[metricType][regular]++
			count[flowControlEvalMetricType][elastic]++

why is this not indexed by metricType. Same question below.


pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 97 at r4 (raw file):

	p.evalCounters.Range(gaugeUpdateFn(flowControlEvalMetricType))
	p.sendCounters.Range(gaugeUpdateFn(flowControlEvalMetricType))

shouldn't this be flowControlSendMetricType?

Worth adding some test cases that would find bugs.


pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 93 at r4 (raw file):

type deltaStats struct {
	noTokenDuration                time.Duration
	tokensDeducted, tokensReturned kvflowcontrol.Tokens

we don't seem to use these tokensDeducted and tokensReturned stats. Was this carried over from v1 because of the different way it was doing cumulative metrics?


pkg/kv/kvserver/kvflowcontrol/rac2/token_counter_test.go line 44 at r4 (raw file):

}

func TestTokenAdjustment(t *testing.T) {

is it easy to check the metrics values in a few places in the tests in this file?

@kvoli kvoli force-pushed the 240820.rac-metrics branch from ede81ad to 447c3fc Compare August 29, 2024 17:34
Copy link
Collaborator Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)


pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 81 at r4 (raw file):

Previously, sumeerbhola wrote…

why is this not indexed by metricType. Same question below.

Bug, fixed.


pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go line 97 at r4 (raw file):

Previously, sumeerbhola wrote…

shouldn't this be flowControlSendMetricType?

Worth adding some test cases that would find bugs.

It should be. Added some tests.


pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 93 at r4 (raw file):

Previously, sumeerbhola wrote…

we don't seem to use these tokensDeducted and tokensReturned stats. Was this carried over from v1 because of the different way it was doing cumulative metrics?

These will be used for blocked stream logging, I added them in as part of the deltaStats but only the noTokenDuration is being used atm. I can remove these from the PR and move to follow up w/ the logging but would prefer to keep them otherwise.


pkg/kv/kvserver/kvflowcontrol/rac2/token_counter_test.go line 44 at r4 (raw file):

Previously, sumeerbhola wrote…

is it easy to check the metrics values in a few places in the tests in this file?

Yeah pretty easy. Added.

@kvoli kvoli requested a review from sumeerbhola August 29, 2024 17:35
@kvoli
Copy link
Collaborator Author

kvoli commented Aug 29, 2024

Should be ready for another look @sumeerbhola.

@kvoli kvoli force-pushed the 240820.rac-metrics branch 2 times, most recently from 7e4977e to a99fcaf Compare August 29, 2024 20:49
kvoli added 2 commits August 29, 2024 18:21
Prior to this change, `TokenCounter` provided an interface implemented
by `*tokenCounter`. As there is only one implementation, de-interface
`TokenCounter`. Also, store the TokenCounter in a `syncutil.Map`, as opposed
to a native mutex protected map.

Epic: CRDB-37515
Release note: None
This commit introduces metrics related to stream eval tokens and stream
send tokens. Hooking up these metrics to the registry will be in a
subsequent commit.

There are two separate metric structs used:

1. `tokenCounterMetrics`, which only contains counter and is shared
   among all `tokenCounter`s on the same node. Each `tokenCounter`
   updates the shared counters after `adjust` is called.
2. `tokenStreamMetrics`, which is updated periodically by calling
   `UpdateMetricGauges` via the `StreamTokenCounterProvider`, which is
   one per node.

Metrics related to `WaitForEval` (as well as blocked stream logging) are
also deferred to a subsequent commit.

Part of: cockroachdb#128031
Release note: None
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 8 files at r7, 7 of 7 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go line 93 at r4 (raw file):

Previously, kvoli (Austen) wrote…

These will be used for blocked stream logging, I added them in as part of the deltaStats but only the noTokenDuration is being used atm. I can remove these from the PR and move to follow up w/ the logging but would prefer to keep them otherwise.

Fine to keep them

@kvoli
Copy link
Collaborator Author

kvoli commented Aug 30, 2024

TYFTR!

bors r=sumeerbhola

@craig craig bot merged commit 3373e5d into cockroachdb:master Aug 30, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants