release-22.1: kvcoord: restart stuck RangeFeeds #87253

tbg · 2022-09-01T07:36:56Z

Backport 1/1 commits from #86820.

/cc @cockroachdb/release

It has been observed in the wild, that rangefeeds (and changefeeds
that use them) would appear to be stuck and not make any progress.
It has been determined that, for reasons yet unknown, the
RangeFeed RPC stops receiving events from the server, even though
the contract indicates that such events should always come in
since events also include range checkpoint records that should always
be emitted periodically.

This PR introduces a defense in depth mechanism to the client side range
feed library so that ranges that appear to be stuck are restarted
automatically, based on the value of the
kv.rangefeed.range_stuck_threshold cluster setting.

Concretely, DistSender waits until it has seen the first event on the
stream (since there is a semaphore on the server side that can limit
delay when the catch-up phase first emits data) and then activates the
protection mechanism.

The metric distsender.rangefeed.restart_stuck tracks how often this
fires, along with newly added telemetry
rangefeed.stuck.{during,after}-catchup-scan.

Note: the backport has some extra code not present on master:

When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event. We weren't previously marking this
path as a changefeed retry. Now we are.

master will need a proper fix for this as well.

TestRestartStuckRangeFeeds from this commit reliably shakes this out
under flake in only a few iterations.

Touches https://github.com/cockroachlabs/support/issues/1729.

Release justification: stability improvement.
Release note (enterprise change): The new
kv.rangefeed.range_stuck_threshold (default 0, i.e. disabled) cluster
setting instructs RangeFeed clients (used internally by changefeeds) to
restart automatically if no checkpoint or other event has been received
from the server for some time. This is a defense-in-depth mechanism
which will log output as follows if triggered: restarting stuck
rangefeed: waiting for r100 (n1,s1):1 [threshold 1m]: rangefeed
restarting due to inactivity.

blathers-crl · 2022-09-01T07:37:00Z

cockroach-teamcity · 2022-09-01T07:37:07Z

This change is

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained

-- commits line 26 at r2:
☝️

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 70 at r1 (raw file):

	"kv.rangefeed.range_stuck_threshold",
	"restart rangefeeds if they appear to be stuck for the specified threshold; 0 disables",
	time.Minute,

☝️

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 513 at r2 (raw file):

			if err != nil {
				if stuckWatcher.stuck() {
					return args.Timestamp, ds.handleStuckEvent(&args, catchupRes == nil, stuckWatcher.threshold())

handleStuckEvent inlines the code that is present here on master, without additional changes.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 515 at r2 (raw file):

					return args.Timestamp, ds.handleStuckEvent(&args, catchupRes == nil, stuckWatcher.threshold())
				}
				return args.Timestamp, errors.Wrap(err, "not-stuck")

Oops need to remove this wrapping. No good debug session without accidentally committing something.

Code quote:

return args.Timestamp, errors.Wrap(err, "not-stuck")

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 534 at r2 (raw file):

					ds.metrics.RangefeedErrorCatchup.Inc(1)
				}
				if stuckWatcher.stuck() {

This is new and I need something like this on master as well. The test discovered this.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go line 1 at r2 (raw file):

// Copyright 2022 The Cockroach Authors.

This file is a clean backport, so no close review needed.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler_test.go line 1 at r2 (raw file):

// Copyright 2022 The Cockroach Authors.

Clean backport, no close review needed.

pkg/kv/kvclient/kvcoord/dist_sender_stuck_rangefeed_test.go line 37 at r2 (raw file):

)

func TestRestartsStuckRangeFeeds(t *testing.T) {

I had to completely rewrite this test to avoid pulling in a bunch more code, so review this as new code.

pkg/kv/kvclient/kvcoord/dist_sender_stuck_rangefeed_test.go line 122 at r2 (raw file):

}

// rangeFeed is a helper to execute rangefeed.  We are not using rangefeed library

This was cribbed from master, but just pretend it's new since I had to change it a bunch anyway.

It has been observed in the wild, that rangefeeds (and changefeeds that use them) would appear to be stuck and not make any progress. It has been determined that, for reasons yet unknown, the RangeFeed RPC stops receiving events from the server, even though the contract indicates that such events should always come in since events also include range checkpoint records that should always be emitted periodically. This PR introduces a defense in depth mechanism to the client side range feed library so that ranges that appear to be stuck are restarted automatically, based on the value of the `kv.rangefeed.range_stuck_threshold` cluster setting. Concretely, DistSender waits until it has seen the first event on the stream (since there is a semaphore on the server side that can limit delay when the catch-up phase first emits data) and then activates the protection mechanism. The metric `distsender.rangefeed.restart_stuck` tracks how often this fires, along with newly added telemetry `rangefeed.stuck.{during,after}-catchup-scan`. Note: the backport has some extra code not present on `master`: When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We weren't previously marking this path as a changefeed retry. Now we are. `master` will need a proper fix for this as well. `TestRestartStuckRangeFeeds` from this commit reliably shakes this out under flake in only a few iterations. Touches cockroachlabs/support#1729. Release justification: stability improvement. Release note (enterprise change): The new `kv.rangefeed.range_stuck_threshold` (default 0, i.e. disabled) cluster setting instructs RangeFeed clients (used internally by changefeeds) to restart automatically if no checkpoint or other event has been received from the server for some time. This is a defense-in-depth mechanism which will log output as follows if triggered: restarting stuck rangefeed: waiting for r100 (n1,s1):1 [threshold 1m]: rangefeed restarting due to inactivity.

tbg · 2022-09-01T12:00:27Z

@erikgrinaker please refer to my Reviewable comments for guidance on how to review.

erikgrinaker

Thanks for the additional effort here!

Reviewed 6 of 7 files at r1, 1 of 1 files at r3, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)

-- commits line 26 at r2:

Previously, tbg (Tobias Grieger) wrote…

☝️

Nice find.

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 534 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is new and I need something like this on master as well. The test discovered this.

Seems reasonable.

Front-ports parts of cockroachdb#87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from cockroachdb#87253 (which is on release-22.1). Fixes cockroachdb#87370. No release note since this will be backported to release-22.2 Release note: None

87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling This commit fixes a small bug on the transaction insight details page that was incorrectly mapping the waiting transaction statement fingerprints to the blocking transaction statements. The commit also aligns the summary cards in the details page. The commit also removes the contended key from the details page while we look for a more user- friendly format to display row contention. Before: ![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png) After: ![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png) Fixes #87838. Release note: None Release justification: bug fix 87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Addresses: #87104. Release note: None 87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes. Informs #86790 Release note: None 87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg Front-ports parts of #87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from #87253 (which is on release-22.1). Fixes #87370. No release note since this will be backported to release-22.2 Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>

Front-ports parts of #87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from #87253 (which is on release-22.1). Fixes #87370. No release note since this will be backported to release-22.2 Release note: None

tbg force-pushed the backport22.1-86820 branch 4 times, most recently from c135361 to 97264ca Compare September 1, 2022 11:55

tbg commented Sep 1, 2022

View reviewed changes

tbg force-pushed the backport22.1-86820 branch from 97264ca to 80f5534 Compare September 1, 2022 11:59

tbg marked this pull request as ready for review September 1, 2022 12:00

tbg requested a review from a team September 1, 2022 12:00

tbg requested a review from a team as a code owner September 1, 2022 12:00

tbg requested a review from erikgrinaker September 1, 2022 12:00

erikgrinaker approved these changes Sep 1, 2022

View reviewed changes

shermanCRL mentioned this pull request Sep 1, 2022

changefeeds: ranges getting “stuck" #87237

Closed

tbg merged commit a70cc0e into cockroachdb:release-22.1 Sep 5, 2022

tbg deleted the backport22.1-86820 branch September 5, 2022 09:16

tbg mentioned this pull request Sep 5, 2022

kvcoord: need to handle stuck watcher restart in additional location #87370

Closed

cockroach-teamcity mentioned this pull request Sep 5, 2022

kvcoord: restart stuck RangeFeeds cockroachdb/docs#15009

Closed

tbg mentioned this pull request Sep 12, 2022

kvcoord: reliably handle stuck watcher error #87854

Merged

blathers-crl bot mentioned this pull request Sep 12, 2022

release-22.2: kvcoord: reliably handle stuck watcher error #87880

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.1: kvcoord: restart stuck RangeFeeds #87253

release-22.1: kvcoord: restart stuck RangeFeeds #87253

tbg commented Sep 1, 2022 •

edited

Loading

blathers-crl bot commented Sep 1, 2022 •

edited by tbg

Loading

cockroach-teamcity commented Sep 1, 2022

tbg left a comment

tbg commented Sep 1, 2022

erikgrinaker left a comment

release-22.1: kvcoord: restart stuck RangeFeeds #87253

release-22.1: kvcoord: restart stuck RangeFeeds #87253

Conversation

tbg commented Sep 1, 2022 • edited Loading

blathers-crl bot commented Sep 1, 2022 • edited by tbg Loading

cockroach-teamcity commented Sep 1, 2022

tbg left a comment

Choose a reason for hiding this comment

tbg commented Sep 1, 2022

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg commented Sep 1, 2022 •

edited

Loading

blathers-crl bot commented Sep 1, 2022 •

edited by tbg

Loading