-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.1: kvcoord: restart stuck RangeFeeds #87253
Conversation
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
c135361
to
97264ca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 70 at r1 (raw file):
"kv.rangefeed.range_stuck_threshold", "restart rangefeeds if they appear to be stuck for the specified threshold; 0 disables", time.Minute,
☝️
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 513 at r2 (raw file):
if err != nil { if stuckWatcher.stuck() { return args.Timestamp, ds.handleStuckEvent(&args, catchupRes == nil, stuckWatcher.threshold())
handleStuckEvent inlines the code that is present here on master, without additional changes.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 515 at r2 (raw file):
return args.Timestamp, ds.handleStuckEvent(&args, catchupRes == nil, stuckWatcher.threshold()) } return args.Timestamp, errors.Wrap(err, "not-stuck")
Oops need to remove this wrapping. No good debug session without accidentally committing something.
Code quote:
return args.Timestamp, errors.Wrap(err, "not-stuck")
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 534 at r2 (raw file):
ds.metrics.RangefeedErrorCatchup.Inc(1) } if stuckWatcher.stuck() {
This is new and I need something like this on master as well. The test discovered this.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go
line 1 at r2 (raw file):
// Copyright 2022 The Cockroach Authors.
This file is a clean backport, so no close review needed.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler_test.go
line 1 at r2 (raw file):
// Copyright 2022 The Cockroach Authors.
Clean backport, no close review needed.
pkg/kv/kvclient/kvcoord/dist_sender_stuck_rangefeed_test.go
line 37 at r2 (raw file):
) func TestRestartsStuckRangeFeeds(t *testing.T) {
I had to completely rewrite this test to avoid pulling in a bunch more code, so review this as new code.
pkg/kv/kvclient/kvcoord/dist_sender_stuck_rangefeed_test.go
line 122 at r2 (raw file):
} // rangeFeed is a helper to execute rangefeed. We are not using rangefeed library
This was cribbed from master, but just pretend it's new since I had to change it a bunch anyway.
It has been observed in the wild, that rangefeeds (and changefeeds that use them) would appear to be stuck and not make any progress. It has been determined that, for reasons yet unknown, the RangeFeed RPC stops receiving events from the server, even though the contract indicates that such events should always come in since events also include range checkpoint records that should always be emitted periodically. This PR introduces a defense in depth mechanism to the client side range feed library so that ranges that appear to be stuck are restarted automatically, based on the value of the `kv.rangefeed.range_stuck_threshold` cluster setting. Concretely, DistSender waits until it has seen the first event on the stream (since there is a semaphore on the server side that can limit delay when the catch-up phase first emits data) and then activates the protection mechanism. The metric `distsender.rangefeed.restart_stuck` tracks how often this fires, along with newly added telemetry `rangefeed.stuck.{during,after}-catchup-scan`. Note: the backport has some extra code not present on `master`: When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We weren't previously marking this path as a changefeed retry. Now we are. `master` will need a proper fix for this as well. `TestRestartStuckRangeFeeds` from this commit reliably shakes this out under flake in only a few iterations. Touches cockroachlabs/support#1729. Release justification: stability improvement. Release note (enterprise change): The new `kv.rangefeed.range_stuck_threshold` (default 0, i.e. disabled) cluster setting instructs RangeFeed clients (used internally by changefeeds) to restart automatically if no checkpoint or other event has been received from the server for some time. This is a defense-in-depth mechanism which will log output as follows if triggered: restarting stuck rangefeed: waiting for r100 (n1,s1):1 [threshold 1m]: rangefeed restarting due to inactivity.
97264ca
to
80f5534
Compare
@erikgrinaker please refer to my Reviewable comments for guidance on how to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional effort here!
Reviewed 6 of 7 files at r1, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
Previously, tbg (Tobias Grieger) wrote…
☝️
Nice find.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 534 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is new and I need something like this on master as well. The test discovered this.
Seems reasonable.
Front-ports parts of cockroachdb#87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from cockroachdb#87253 (which is on release-22.1). Fixes cockroachdb#87370. No release note since this will be backported to release-22.2 Release note: None
87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling This commit fixes a small bug on the transaction insight details page that was incorrectly mapping the waiting transaction statement fingerprints to the blocking transaction statements. The commit also aligns the summary cards in the details page. The commit also removes the contended key from the details page while we look for a more user- friendly format to display row contention. Before: ![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png) After: ![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png) Fixes #87838. Release note: None Release justification: bug fix 87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Addresses: #87104. Release note: None 87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes. Informs #86790 Release note: None 87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg Front-ports parts of #87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from #87253 (which is on release-22.1). Fixes #87370. No release note since this will be backported to release-22.2 Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
Front-ports parts of #87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from #87253 (which is on release-22.1). Fixes #87370. No release note since this will be backported to release-22.2 Release note: None
Backport 1/1 commits from #86820.
/cc @cockroachdb/release
It has been observed in the wild, that rangefeeds (and changefeeds
that use them) would appear to be stuck and not make any progress.
It has been determined that, for reasons yet unknown, the
RangeFeed RPC stops receiving events from the server, even though
the contract indicates that such events should always come in
since events also include range checkpoint records that should always
be emitted periodically.
This PR introduces a defense in depth mechanism to the client side range
feed library so that ranges that appear to be stuck are restarted
automatically, based on the value of the
kv.rangefeed.range_stuck_threshold
cluster setting.Concretely, DistSender waits until it has seen the first event on the
stream (since there is a semaphore on the server side that can limit
delay when the catch-up phase first emits data) and then activates the
protection mechanism.
The metric
distsender.rangefeed.restart_stuck
tracks how often thisfires, along with newly added telemetry
rangefeed.stuck.{during,after}-catchup-scan
.Note: the backport has some extra code not present on
master
:When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event. We weren't previously marking this
path as a changefeed retry. Now we are.
master
will need a proper fix for this as well.TestRestartStuckRangeFeeds
from this commit reliably shakes this outunder flake in only a few iterations.
Touches https://github.com/cockroachlabs/support/issues/1729.
Release justification: stability improvement.
Release note (enterprise change): The new
kv.rangefeed.range_stuck_threshold
(default 0, i.e. disabled) clustersetting instructs RangeFeed clients (used internally by changefeeds) to
restart automatically if no checkpoint or other event has been received
from the server for some time. This is a defense-in-depth mechanism
which will log output as follows if triggered: restarting stuck
rangefeed: waiting for r100 (n1,s1):1 [threshold 1m]: rangefeed
restarting due to inactivity.