-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rac2: add token counter and stream metrics #129350
Conversation
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
9effb64
to
c60b8e8
Compare
c60b8e8
to
10409e4
Compare
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
10409e4
to
9e46ab4
Compare
f6cb7f9
to
c39768b
Compare
c39768b
to
4ae6f67
Compare
Open to suggestions for the migration. I was planning on just hooking these up and having both metrics be enabled at once. We could reuse the existing metrics with some heavy refactoring, if that were desirable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go
line 161 at r2 (raw file):
ElasticFlowTokensDeducted *metric.Counter ElasticFlowTokensReturned *metric.Counter ElasticFlowTokensUnaccounted *metric.Counter
is there a reason these are not arrays too?
btw, I am not a fan of the callback based pattern of adjustment that was used in the RACv1 code -- it is error prone and breaks abstraction boundaries. The two patterns I prefer:
- The kvserver metrics pattern, where something calls the various components periodically, extracts stats structs and updates both gauges and counters. It does require more work to implement, since the various components need to provide their cumulative stats. In cases like Pebble this was already the case.
- The data-structures are responsible for updating the (sometimes shared) cumulative metrics when they update their state. This distributes the metrics update to where the state updates are happening which IMO is desirable. Less abstraction leakage.
The other minor nit with the RACv1 structuring is that it prefixed everything with kvadmission.flow_controller
. RACv1 was not in the kvadmission package, and some metrics relate to token counters and some relate to waiting etc. I think we need to address this since we now have both eval and send tokens and concepts like WaitDuration
aren't necessarily relevant for send tokens, and they are relevant to WaitForEval. Also, since we have both eval and send tokens, I withdraw my earlier position that we should share metrics with RACv1. There was a valid point made yesterday by Andrew about not naming these v2.
So I'd suggest a naming scheme that segments out the TokenCounter metrics, such as
kvflowcontrol.tokens.<eval|send>.<regular|elastic>.<deducted|returned|unaccounted|available>
Then have a struct like
type tokenCounterMetrics {
deducted [NumWorkClasses]*metric.Counter
returned ...
...
}
containing all the cumulative metrics that a TokenCounter
needs to adjust. Provide the TokenCounter
the struct when creating it in StreamTokenCounterProvider
. StreamTokenCounterProvider
is the one that will interface with a registry since there is only one provider is a node.
Additionally StreamTokenCounterProvider
will have a separate struct for gauge metrics (all functional gauges) that require iteration over the TokenCounters (e.g. stream count, available tokens).
All the StreamTokenCounterProvider
and TokenCounter
changes belong in one PR.
The remaining are eval wait metrics, which could all be named with the kvflowcontrol.eval_wait prefix. We could possibly update them in RangeController.WaitForEval
. RangeController
would expect a metrics struct to be provided to it as part of its options.
pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go
line 37 at r1 (raw file):
// Eval returns the evaluation token counter for the given stream. func (p *StreamTokenCounterProvider) Eval(stream kvflowcontrol.Stream) *TokenCounter { t, _ := p.evalCounters.LoadOrStore(stream, NewTokenCounter(p.settings))
We shouldn't be using LoadOrStore
directly since NewTokenCounter
is expensive (it allocates). The usual pattern is to first call Load
and if it is not found, call LoadOrStore
. And if we expect a high concurrency in those that fail the Load
, and the new is very expensive, do a creation mutex acquisition, call Load
a second time, and if it fails call LoadOrStore
while holding the mutex. In this case I think that is not necessary.
pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go
line 43 at r1 (raw file):
// Send returns the send token counter for the given stream. func (p *StreamTokenCounterProvider) Send(stream kvflowcontrol.Stream) *TokenCounter { t, _ := p.sendCounters.LoadOrStore(stream, NewTokenCounter(p.settings))
ditto
pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go
line 144 at r1 (raw file):
// kvflowcontrol.Stream. It's used to synchronize handoff between threads // returning and waiting for flow tokens. type TokenCounter struct {
is this exported since it is getting called outside the package (or will be)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)
pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go
line 161 at r2 (raw file):
Previously, sumeerbhola wrote…
is there a reason these are not arrays too?
btw, I am not a fan of the callback based pattern of adjustment that was used in the RACv1 code -- it is error prone and breaks abstraction boundaries. The two patterns I prefer:
- The kvserver metrics pattern, where something calls the various components periodically, extracts stats structs and updates both gauges and counters. It does require more work to implement, since the various components need to provide their cumulative stats. In cases like Pebble this was already the case.
- The data-structures are responsible for updating the (sometimes shared) cumulative metrics when they update their state. This distributes the metrics update to where the state updates are happening which IMO is desirable. Less abstraction leakage.
The other minor nit with the RACv1 structuring is that it prefixed everything with
kvadmission.flow_controller
. RACv1 was not in the kvadmission package, and some metrics relate to token counters and some relate to waiting etc. I think we need to address this since we now have both eval and send tokens and concepts likeWaitDuration
aren't necessarily relevant for send tokens, and they are relevant to WaitForEval. Also, since we have both eval and send tokens, I withdraw my earlier position that we should share metrics with RACv1. There was a valid point made yesterday by Andrew about not naming these v2.So I'd suggest a naming scheme that segments out the TokenCounter metrics, such as
kvflowcontrol.tokens.<eval|send>.<regular|elastic>.<deducted|returned|unaccounted|available>Then have a struct like
type tokenCounterMetrics { deducted [NumWorkClasses]*metric.Counter returned ... ... }
containing all the cumulative metrics that a
TokenCounter
needs to adjust. Provide theTokenCounter
the struct when creating it inStreamTokenCounterProvider
.StreamTokenCounterProvider
is the one that will interface with a registry since there is only one provider is a node.
AdditionallyStreamTokenCounterProvider
will have a separate struct for gauge metrics (all functional gauges) that require iteration over the TokenCounters (e.g. stream count, available tokens).All the
StreamTokenCounterProvider
andTokenCounter
changes belong in one PR.The remaining are eval wait metrics, which could all be named with the kvflowcontrol.eval_wait prefix. We could possibly update them in
RangeController.WaitForEval
.RangeController
would expect a metrics struct to be provided to it as part of its options.
TYFTR
I'll go through and do some refactoring. Agree on most of these points, these diffs were taken partially as is from the prototype which did the same thing with the existing v1 metrics.
4b332f6
to
3af02c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be good for another round @sumeerbhola.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)
pkg/kv/kvserver/kvflowcontrol/rac2/metrics.go
line 161 at r2 (raw file):
Previously, kvoli (Austen) wrote…
TYFTR
I'll go through and do some refactoring. Agree on most of these points, these diffs were taken partially as is from the prototype which did the same thing with the existing v1 metrics.
Updated to slim down this PR to just the token/stream metrics. The logging, hookup and eval metrics will be in separate PRs, ontop of this.
pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go
line 37 at r1 (raw file):
Previously, sumeerbhola wrote…
We shouldn't be using
LoadOrStore
directly sinceNewTokenCounter
is expensive (it allocates). The usual pattern is to first callLoad
and if it is not found, callLoadOrStore
. And if we expect a high concurrency in those that fail theLoad
, and the new is very expensive, do a creation mutex acquisition, callLoad
a second time, and if it fails callLoadOrStore
while holding the mutex. In this case I think that is not necessary.
Done.
pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go
line 43 at r1 (raw file):
Previously, sumeerbhola wrote…
ditto
Done.
pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go
line 144 at r1 (raw file):
Previously, sumeerbhola wrote…
is this exported since it is getting called outside the package (or will be)?
No reason it should be, changed to private.
3af02c8
to
ede81ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r3, 7 of 7 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go
line 81 at r4 (raw file):
return func(stream kvflowcontrol.Stream, t *tokenCounter) bool { count[metricType][regular]++ count[flowControlEvalMetricType][elastic]++
why is this not indexed by metricType
. Same question below.
pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go
line 97 at r4 (raw file):
p.evalCounters.Range(gaugeUpdateFn(flowControlEvalMetricType)) p.sendCounters.Range(gaugeUpdateFn(flowControlEvalMetricType))
shouldn't this be flowControlSendMetricType?
Worth adding some test cases that would find bugs.
pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go
line 93 at r4 (raw file):
type deltaStats struct { noTokenDuration time.Duration tokensDeducted, tokensReturned kvflowcontrol.Tokens
we don't seem to use these tokensDeducted
and tokensReturned
stats. Was this carried over from v1 because of the different way it was doing cumulative metrics?
pkg/kv/kvserver/kvflowcontrol/rac2/token_counter_test.go
line 44 at r4 (raw file):
} func TestTokenAdjustment(t *testing.T) {
is it easy to check the metrics values in a few places in the tests in this file?
ede81ad
to
447c3fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pav-kv and @sumeerbhola)
pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go
line 81 at r4 (raw file):
Previously, sumeerbhola wrote…
why is this not indexed by
metricType
. Same question below.
Bug, fixed.
pkg/kv/kvserver/kvflowcontrol/rac2/store_stream.go
line 97 at r4 (raw file):
Previously, sumeerbhola wrote…
shouldn't this be flowControlSendMetricType?
Worth adding some test cases that would find bugs.
It should be. Added some tests.
pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go
line 93 at r4 (raw file):
Previously, sumeerbhola wrote…
we don't seem to use these
tokensDeducted
andtokensReturned
stats. Was this carried over from v1 because of the different way it was doing cumulative metrics?
These will be used for blocked stream logging, I added them in as part of the deltaStats but only the noTokenDuration is being used atm. I can remove these from the PR and move to follow up w/ the logging but would prefer to keep them otherwise.
pkg/kv/kvserver/kvflowcontrol/rac2/token_counter_test.go
line 44 at r4 (raw file):
Previously, sumeerbhola wrote…
is it easy to check the metrics values in a few places in the tests in this file?
Yeah pretty easy. Added.
Should be ready for another look @sumeerbhola. |
7e4977e
to
a99fcaf
Compare
Prior to this change, `TokenCounter` provided an interface implemented by `*tokenCounter`. As there is only one implementation, de-interface `TokenCounter`. Also, store the TokenCounter in a `syncutil.Map`, as opposed to a native mutex protected map. Epic: CRDB-37515 Release note: None
This commit introduces metrics related to stream eval tokens and stream send tokens. Hooking up these metrics to the registry will be in a subsequent commit. There are two separate metric structs used: 1. `tokenCounterMetrics`, which only contains counter and is shared among all `tokenCounter`s on the same node. Each `tokenCounter` updates the shared counters after `adjust` is called. 2. `tokenStreamMetrics`, which is updated periodically by calling `UpdateMetricGauges` via the `StreamTokenCounterProvider`, which is one per node. Metrics related to `WaitForEval` (as well as blocked stream logging) are also deferred to a subsequent commit. Part of: cockroachdb#128031 Release note: None
a99fcaf
to
5d555ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r7, 7 of 7 files at r8, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/kvflowcontrol/rac2/token_counter.go
line 93 at r4 (raw file):
Previously, kvoli (Austen) wrote…
These will be used for blocked stream logging, I added them in as part of the deltaStats but only the noTokenDuration is being used atm. I can remove these from the PR and move to follow up w/ the logging but would prefer to keep them otherwise.
Fine to keep them
TYFTR! bors r=sumeerbhola |
This commit introduces metrics related to stream eval tokens and stream
send tokens. Hooking up these metrics to the registry will be in a
subsequent commit.
There are two separate metric structs used:
tokenCounterMetrics
, which only contains counter and is sharedamong all
tokenCounter
s on the same node. EachtokenCounter
updates the shared counters after
adjust
is called.tokenStreamMetrics
, which is updated periodically by callingUpdateMetricGauges
via theStreamTokenCounterProvider
, which isone per node.
Metrics related to
WaitForEval
(as well as blocked stream logging) arealso deferred to a subsequent commit.
Part of: #128031
Release note: None