-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: Pace rangefeed client goroutine creation #109346
Conversation
81e2e98
to
11d6556
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm flushing some initial comments, but I have to follow up on a high-priority L2 escalation. Will resume the review later.
Reviewed 6 of 8 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @ericharmeling, and @miretskiy)
-- commits
line 26 at r1:
Could use a release note, both for the pacing and observability.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 214 at r1 (raw file):
} metaDistSenderRangefeedPostCatchupRanges = metric.Metadata{ Name: "distsender.rangefeed.post_catchup_ranges",
Why do we need this when we already have distsender.rangefeed.catchup_ranges? Do we expect this to be any different than distsender.rangefeed.total_ranges - distsender.rangefeed.catchup_ranges? If so, why, and maybe we should fix that instead?
pkg/kv/kvclient/kvcoord/dist_sender.go
line 370 at r1 (raw file):
var retryCounters [kvpb.NumRangeFeedRetryErrors]*metric.Counter for idx, name := range kvpb.RangeFeedRetryError_Reason_name { retryCounters[idx] = metric.NewCounter(metric.Metadata{
This will break when we remove a reason. Consider using a map instead, which also removes the need for NumRangeFeedRetryErrors
and associated logic.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 371 at r1 (raw file):
for idx, name := range kvpb.RangeFeedRetryError_Reason_name { retryCounters[idx] = metric.NewCounter(metric.Metadata{ Name: fmt.Sprintf("distsender.rangefeed.retry.%s", strings.ToLower(name)),
Should we strip the reason_ prefix here? Also, we may want to add a note on RangeFeedRetryError_Reason
that the reason names should be stable because they are exposed via metric names.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 372 at r1 (raw file):
retryCounters[idx] = metric.NewCounter(metric.Metadata{ Name: fmt.Sprintf("distsender.rangefeed.retry.%s", strings.ToLower(name)), Help: `Number of ranges in retried due to rangefeed retry error`,
What about retries due to e.g. connection failures and other errors? Should those be included here too? We'd want this to add up to distsender.rangefeed.restart_ranges, right?
Also, nit: "number of rangefeed retries". It's not the number of ranges, it's the number of retried errors.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 372 at r1 (raw file):
// catchupRes is the catchup scan quota acquired upon the // start of rangefeed. // It is released when this stream receives first non-empty checkpoint
The rangefeed protocol sends an empty checkpoint precisely to signal completion of the catchup scan, so the non-empty bit here seems wrong.
cockroach/pkg/kv/kvserver/rangefeed/processor.go
Lines 393 to 398 in b35afe9
// Immediately publish a checkpoint event to the registry. This will be the first event | |
// published to this registration after its initial catch-up scan completes. The resolved | |
// timestamp might be empty but the checkpoint event is still useful to indicate that the | |
// catch-up scan has completed. This allows clients to rely on stronger ordering semantics | |
// once they observe the first checkpoint event. | |
r.publish(ctx, p.newCheckpointEvent(), nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @ericharmeling, and @erikgrinaker)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 214 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Why do we need this when we already have distsender.rangefeed.catchup_ranges? Do we expect this to be any different than distsender.rangefeed.total_ranges - distsender.rangefeed.catchup_ranges? If so, why, and maybe we should fix that instead?
You're right... Removed it... I think I had it as a counter at some point; but switching back to gauge made it pretty useless.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 370 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This will break when we remove a reason. Consider using a map instead, which also removes the need for
NumRangeFeedRetryErrors
and associated logic.
First, I unfortunately cannot use map -- metrics registry does not support that. Either each retry
reason has to be spelled out in a struct map... or you have to use array.
Second, you can't (or I should say, you shouldn't) just delete enum value -- just like deleting any other field
you should "reserve" that value. But you do bring up a good point that when we reserve a value (because of deletion), the map indexes are no longer contiguous.
I've made some changes here. First, I added another change to modify metrics registry to support slices.
For the life of me, I don't understand why we had this restriction (no slices) in the first place.
This made it much easier to generate slice of counters here (w/out relying on NumRangeFeedRetryErrors
constant).
Second, I introduced a helper struct along with accessor method to hold these counters.
I added comments to clarify why we need this indirection.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 371 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Should we strip the reason_ prefix here? Also, we may want to add a note on
RangeFeedRetryError_Reason
that the reason names should be stable because they are exposed via metric names.
Yes; done.
And done.
pkg/kv/kvclient/kvcoord/dist_sender.go
line 372 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
What about retries due to e.g. connection failures and other errors? Should those be included here too? We'd want this to add up to distsender.rangefeed.restart_ranges, right?
Also, nit: "number of rangefeed retries". It's not the number of ranges, it's the number of retried errors.
Rephrased comment -- it was "fat finger" copy pasta.
Initially, I wasn't planning to handle all errors, but okay -- more metrics added.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 372 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
The rangefeed protocol sends an empty checkpoint precisely to signal completion of the catchup scan, so the non-empty bit here seems wrong.
cockroach/pkg/kv/kvserver/rangefeed/processor.go
Lines 393 to 398 in b35afe9
// Immediately publish a checkpoint event to the registry. This will be the first event // published to this registration after its initial catch-up scan completes. The resolved // timestamp might be empty but the checkpoint event is still useful to indicate that the // catch-up scan has completed. This allows clients to rely on stronger ordering semantics // once they observe the first checkpoint event. r.publish(ctx, p.newCheckpointEvent(), nil)
Perhaps; but we never released catchup scan reservation when receiving empty checkpoint:
if !t.ResolvedTS.IsEmpty() && active.catchupRes != nil {
active.releaseCatchupScan()
}
That code existed since... forever.
a43b57f
to
ef0e5ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks workable with some nits about documenting it.
The part about enabling slices in metrics makes me uneasy as we are removing assertion. Maybe split that out in separate commit in this PR?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ericharmeling, @erikgrinaker, and @miretskiy)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 372 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Perhaps; but we never released catchup scan reservation when receiving empty checkpoint:
if !t.ResolvedTS.IsEmpty() && active.catchupRes != nil { active.releaseCatchupScan() }
That code existed since... forever.
I checked the history and we added catchup scan metric with this condition in 2022 (#77711) but there's no justification on why we filter empty checkpoint which is specifically meant to indicate end of catch up as pointed above.
Adding limiter was a subsequent commit. Maybe that's the reason no one payed attention?
This check is deferring throttling to the next emitted checkpoint. It was likely negligible with 200ms interval, but if we want to bump it, then I think we won't see a checkpoint for a bit.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 374 at r2 (raw file):
// It is released when this stream receives first non-empty checkpoint // (meaning: catchup scan completes). catchupRes catchupAlloc
I think this struct is becoming hairy and the comment about thread safe partialRangeFeed
is making it worse. In fact it now has two parts, one is partial range feed under mutex, and another which is state that is maintained explicitly for singleRangeFeed
.
Unsafe part is used within a switch on the event type and the other is for onRangeEvent
which has its own switch on type internally.
It looks like an unfortunate side effect of moving the state from singleRangeFeed
out to be able to do cleanups. And unsafe part is safe by convention that it is only ever used by creator of the struct or by work loop (also by mux range feed work loop).
If we can't keep them separate at least we should call it out.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 511 at r2 (raw file):
} // partialRangeFeed establishes a RangeFeed to the range specified by desc. It
nit: Comment seem out of date as desc is not passed since 2018. We now use token as a destination to establish RangeFeed.
Worth updating since we are changing signature here.
pkg/util/metric/registry.go
line 256 at r2 (raw file):
switch fieldType.Kind() { case reflect.Array, reflect.Slice:
Is there any rationale why we explicitly forbid slices, did we try to avoid reference types and keep everything by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catchup scan parts LGTM, but I agree with Oleg, let's split off the metrics changes to a separate PR. The slice stuff needs a bit closer attention, and we may want to backport this in which case we want to keep changes separate and focused.
Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @ericharmeling, and @miretskiy)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 386 at r2 (raw file):
RetryErrors: retryCounters, Stuck: metric.NewCounter(retryMeta("stuck")), SendErrors: metric.NewCounter(retryMeta("send error")),
nit: this will be formatted as "send error error".
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 372 at r1 (raw file):
Previously, aliher1911 (Oleg) wrote…
I checked the history and we added catchup scan metric with this condition in 2022 (#77711) but there's no justification on why we filter empty checkpoint which is specifically meant to indicate end of catch up as pointed above.
Adding limiter was a subsequent commit. Maybe that's the reason no one payed attention?
This check is deferring throttling to the next emitted checkpoint. It was likely negligible with 200ms interval, but if we want to bump it, then I think we won't see a checkpoint for a bit.
On master, this will now be every 3 seconds, which is fairly significant. We should release the catchup scan semaphore as soon as we receive a checkpoint, empty or otherwise. We can either fix that here or in a separate PR, but we should fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack re metrics. I actually meant to do that but accidentally committed together. Thanks for bringing this up.
I did some archeological digging to determine why we supported arrays vs both arrays and slices. See newly added commit for explanation.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @ericharmeling, and @erikgrinaker)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 372 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
On master, this will now be every 3 seconds, which is fairly significant. We should release the catchup scan semaphore as soon as we receive a checkpoint, empty or otherwise. We can either fix that here or in a separate PR, but we should fix it.
Okay; i'm convinced it's safe to drop this restriction; and I know we handle 0 ts on the changefeed side anyway.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 374 at r2 (raw file):
Previously, aliher1911 (Oleg) wrote…
I think this struct is becoming hairy and the comment about thread safe
partialRangeFeed
is making it worse. In fact it now has two parts, one is partial range feed under mutex, and another which is state that is maintained explicitly forsingleRangeFeed
.Unsafe part is used within a switch on the event type and the other is for
onRangeEvent
which has its own switch on type internally.
It looks like an unfortunate side effect of moving the state fromsingleRangeFeed
out to be able to do cleanups. And unsafe part is safe by convention that it is only ever used by creator of the struct or by work loop (also by mux range feed work loop).If we can't keep them separate at least we should call it out.
I really can't disagree w/ you.
I tried to do a bit of a cleanup, by introducing mutable/immutable structs, etc.
I also tried to use atomics and avoid mutex altogether. Turns out, mutex was cheaper.
So, I understand what you mean; I decided to punt on this for now, but I have updated comments,
including additional comments on the mutex and why it's not that scary.
Let me know if you think that's okay, @aliher1911 and if not, let me know and I'll try to do something else.
name old time/op new time/op delta
ActiveRangeFeedState-10 34.9ns ± 1% 47.8ns ± 4% +37.01% (p=0.000 n=9+10)
pkg/util/metric/registry.go
line 256 at r2 (raw file):
Previously, aliher1911 (Oleg) wrote…
Is there any rationale why we explicitly forbid slices, did we try to avoid reference types and keep everything by value?
I've created new commit explaining why. I don't think that's the rationale because regardless of array vs slice, they are not added directly; instead they are expanded to add their values anyway. So, as far as I can tell this restriction was superfluous.
41844c2
to
07f93f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack re metrics.
Well, the metrics are still added in the same commit as the pacing, and the metrics need the slice support, so this still sort of hampers targeted backportability -- I'd prefer to separate them out, but your call.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911 and @ericharmeling)
Acquire catchup scan quota prior to goroutine creation in order to pace the goroutine creation rate. This change results in nice and smooth growth in goroutine count, thus reducing the pressure on goroutine scheduler, which in turn reduces the impact on SQL latency during changefeed startup. Improve observability over running changefeeds by adding a column to `crdb_internal.active_rangefeed` virtual table to indicate if the range is currently in catchup scan mode. Fixes cockroachdb#98842 Release note (enterprise change): Pace rangefeed goroutine creation rate to improve scheduler latency. Improve observability by adding additional column in the `crdb_internal.active_rangefeed` table to indicate if the range is currently in catchup scan mode.
Extend metric registry to allow metric slices to be used. Prior to this change metric registry only alowed arrays. Slices are treated identical to the arrays -- the values are expanded, and each metric added to the registry. The array itself was not added. From some historical archeological perspective, cockroachdb#46747 added support for embedded metric Arrays. My best read on this PR is that it narrowly addressed the existing "TODOs", and avoided doing anything extra (which is good!). But it does not appear that there is a fundamental reason why slices should not be supported the same way. Expic: None Release note: None
Improve rangefeed observability by introducing new counters: * `distsender.rangefeed.retry.<reason>`: counter keeping track of the number of ranges that ecountered a retryable error of particular type (e.g. slow counsumer, range split, etc). Fixes cockroachdb#98842 Release note (ops change): Improve rangefeed observability by adding additional metrics indicating the reason for rangefeed restart.
Sorry, you're right -- I misunderstood what you meant. Took a bit of git surgery, but I've split commits. I'm still pretty much 100% sure that the whole thing is safe to backport; but let's hedge anyway. |
bors r+ |
Build succeeded: |
The catchup scan limit was added in cockroachdb#77725 in order to attempt to restrict a single rangefeed from consuming all catchup scan slots on the KV side. This client side limit has been largely ineffective. More recently, this semaphore has been coopted in cockroachdb#109346 in order to pace goroutine creation rate on the client. This functionality is important, but the implementation is not very good. Namely, we are interested in controling the rate of new catchup scans being started by the client (and thus, control the rate of goroutine creation). This implementation replaces the old implementation with rate limit based approach. The rate limits are configured using two new settings: `kv.rangefeed.max_catchup_scans` which sets the maximum burst rate of the number of currently running catchup scans (default 180), and a `kv.rangefeed.catchup_scan_duration_estimate` setting which sets the expected duration of a single catchup scan. Taken together, these settings by default, will limit the number of catchup scans to ~60/second. Closes cockroachdb#110439 Epic: CRDB-26372 Release note: None
The catchup scan limit was added in cockroachdb#77725 in order to attempt to restrict a single rangefeed from consuming all catchup scan slots on the KV side. This client side limit has been largely ineffective. More recently, this semaphore has been coopted in cockroachdb#109346 in order to pace goroutine creation rate on the client. This functionality is important, but the implementation is not very good. Namely, we are interested in controling the rate of new catchup scans being started by the client (and thus, control the rate of goroutine creation). This implementation replaces the old implementation with rate limit based approach. The rate limits are configured using two new settings: `kv.rangefeed.client.startup_range_burst` which sets the maximum burst rate on the number of newly established rangefeed streams (default 180), and a `kv.rangefeed.client.startup_window` setting which sets the window period over which "burst" rangefeed connections may be established. Closes cockroachdb#110439 Epic: CRDB-26372 Release note: None
The catchup scan limit was added in cockroachdb#77725 in order to attempt to restrict a single rangefeed from consuming all catchup scan slots on the KV side. This client side limit has been largely ineffective. More recently, this semaphore has been coopted in cockroachdb#109346 in order to pace goroutine creation rate on the client. This functionality is important, but the implementation is not very good. Namely, we are interested in controling the rate of new catchup scans being started by the client (and thus, control the rate of goroutine creation). This implementation replaces the old implementation with rate limit based approach. The rate limits are configured using two new settings: `kv.rangefeed.client.startup_range_burst` which sets the maximum burst rate on the number of newly established rangefeed streams (default 180), and a `kv.rangefeed.client.startup_window` setting which sets the window period over which "burst" rangefeed connections may be established. Closes cockroachdb#110439 Epic: CRDB-26372 Release note: None
The catchup scan limit was added in cockroachdb#77725 in order to attempt to restrict a single rangefeed from consuming all catchup scan slots on the KV side. This client side limit has been largely ineffective. More recently, this semaphore has been coopted in cockroachdb#109346 in order to pace goroutine creation rate on the client. This functionality is important, but the implementation is not very good. Namely, we are interested in controling the rate of new catchup scans being started by the client (and thus, control the rate of goroutine creation). This implementation replaces the old implementation with rate limit based approach. The rate limits are configured using `kv.rangefeed.client.stream_startup_rate` setting which sets the rate on the number of newly established rangefeed streams (default 100). Closes cockroachdb#110439 Epic: CRDB-26372 Release note: None
The catchup scan limit was added in cockroachdb#77725 in order to attempt to restrict a single rangefeed from consuming all catchup scan slots on the KV side. This client side limit has been largely ineffective. More recently, this semaphore has been coopted in cockroachdb#109346 in order to pace goroutine creation rate on the client. This functionality is important, but the implementation is not very good. Namely, we are interested in controling the rate of new catchup scans being started by the client (and thus, control the rate of goroutine creation). This implementation replaces the old implementation with rate limit based approach. The rate limits are configured using `kv.rangefeed.client.stream_startup_rate` setting which sets the rate on the number of newly established rangefeed streams (default 100). Closes cockroachdb#110439 Epic: CRDB-26372 Release note: None
110919: kvcoord: Replace rangefeed catchup semaphore with rate limiter r=miretskiy a=miretskiy The catchup scan limit was added in #77725 in order to attempt to restrict a single rangefeed from consuming all catchup scan slots on the KV side. This client side limit has been largely ineffective. More recently, this semaphore has been coopted in #109346 in order to pace goroutine creation rate on the client. This functionality is important, but the implementation is not very good. Namely, we are interested in controling the rate of new catchup scans being started by the client (and thus, control the rate of goroutine creation). This implementation replaces the old implementation with rate limit based approach. The rate limits are configured using `kv.rangefeed.client.stream_startup_rate` setting which sets the rate on the number of newly established rangefeed streams (default 100). Closes #110439 Epic: CRDB-26372 Release note: None 110929: dev: fix cross builds running in git worktrees r=rickystewart a=liamgillies Beforehand, running a cross build inside a git worktree would always fail. This PR adds code to also mount the main git repo into docker when in a git worktree, so that running cross builds will succeed. Fixes: #110735 Release note: None Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Liam Gillies <[email protected]>
Acquire catchup scan quota prior to goroutine creation in order to pace the goroutine creation rate.
This change results in nice and smooth growth in
goroutine count, thus reducing the pressure on goroutine scheduler, which in turn reduces the impact on SQL latency during changefeed startup.
This change also improves observability in rangefeed client by introducing new counters:
distsender.rangefeed.retry.<reason>
: counter keeping track of the number of ranges that ecountered a retryable error of particular type (e.g. slow counsumer, range split, etc).Observability also enhanced by adding a column to
crdb_internal.active_rangefeed
virtual table augment to indicateif the range is currently in catchup scan mode.
Fixes #98842
Release note (enterprise change): Pace rangefeed goroutine creation
rate to improve scheduler latency. Improve observability by adding
additional metrics indicating the reason for rangefeed restart
as well as additional column in the
crdb_internal.active_rangefeed
table to indicate if the range is currently in catchup scan mode.