Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: kvcoord: restart stuck RangeFeeds #87253

Merged
merged 1 commit into from
Sep 5, 2022

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 1, 2022

Backport 1/1 commits from #86820.

/cc @cockroachdb/release


It has been observed in the wild, that rangefeeds (and changefeeds
that use them) would appear to be stuck and not make any progress.
It has been determined that, for reasons yet unknown, the
RangeFeed RPC stops receiving events from the server, even though
the contract indicates that such events should always come in
since events also include range checkpoint records that should always
be emitted periodically.

This PR introduces a defense in depth mechanism to the client side range
feed library so that ranges that appear to be stuck are restarted
automatically, based on the value of the
kv.rangefeed.range_stuck_threshold cluster setting.

Concretely, DistSender waits until it has seen the first event on the
stream (since there is a semaphore on the server side that can limit
delay when the catch-up phase first emits data) and then activates the
protection mechanism.

The metric distsender.rangefeed.restart_stuck tracks how often this
fires, along with newly added telemetry
rangefeed.stuck.{during,after}-catchup-scan.

Note: the backport has some extra code not present on master:

When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event. We weren't previously marking this
path as a changefeed retry. Now we are.

master will need a proper fix for this as well.

TestRestartStuckRangeFeeds from this commit reliably shakes this out
under flake in only a few iterations.

Touches https://github.com/cockroachlabs/support/issues/1729.

Release justification: stability improvement.
Release note (enterprise change): The new
kv.rangefeed.range_stuck_threshold (default 0, i.e. disabled) cluster
setting instructs RangeFeed clients (used internally by changefeeds) to
restart automatically if no checkpoint or other event has been received
from the server for some time. This is a defense-in-depth mechanism
which will log output as follows if triggered: restarting stuck
rangefeed: waiting for r100 (n1,s1):1 [threshold 1m]: rangefeed
restarting due to inactivity.

@blathers-crl
Copy link

blathers-crl bot commented Sep 1, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the backport22.1-86820 branch 4 times, most recently from c135361 to 97264ca Compare September 1, 2022 11:55
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


-- commits line 26 at r2:
☝️


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 70 at r1 (raw file):

	"kv.rangefeed.range_stuck_threshold",
	"restart rangefeeds if they appear to be stuck for the specified threshold; 0 disables",
	time.Minute,

☝️


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 513 at r2 (raw file):

			if err != nil {
				if stuckWatcher.stuck() {
					return args.Timestamp, ds.handleStuckEvent(&args, catchupRes == nil, stuckWatcher.threshold())

handleStuckEvent inlines the code that is present here on master, without additional changes.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 515 at r2 (raw file):

					return args.Timestamp, ds.handleStuckEvent(&args, catchupRes == nil, stuckWatcher.threshold())
				}
				return args.Timestamp, errors.Wrap(err, "not-stuck")

Oops need to remove this wrapping. No good debug session without accidentally committing something.

Code quote:

return args.Timestamp, errors.Wrap(err, "not-stuck")

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 534 at r2 (raw file):

					ds.metrics.RangefeedErrorCatchup.Inc(1)
				}
				if stuckWatcher.stuck() {

This is new and I need something like this on master as well. The test discovered this.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go line 1 at r2 (raw file):

// Copyright 2022 The Cockroach Authors.

This file is a clean backport, so no close review needed.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler_test.go line 1 at r2 (raw file):

// Copyright 2022 The Cockroach Authors.

Clean backport, no close review needed.


pkg/kv/kvclient/kvcoord/dist_sender_stuck_rangefeed_test.go line 37 at r2 (raw file):

)

func TestRestartsStuckRangeFeeds(t *testing.T) {

I had to completely rewrite this test to avoid pulling in a bunch more code, so review this as new code.


pkg/kv/kvclient/kvcoord/dist_sender_stuck_rangefeed_test.go line 122 at r2 (raw file):

}

// rangeFeed is a helper to execute rangefeed.  We are not using rangefeed library

This was cribbed from master, but just pretend it's new since I had to change it a bunch anyway.

It has been observed in the wild, that rangefeeds (and changefeeds
that use them) would appear to be stuck and not make any progress.
It has been determined that, for reasons yet unknown, the
RangeFeed RPC stops receiving events from the server, even though
the contract indicates that such events should always come in
since events also include range checkpoint records that should always
be emitted periodically.

This PR introduces a defense in depth mechanism to the client side range
feed library so that ranges that appear to be stuck are restarted
automatically, based on the value of the
`kv.rangefeed.range_stuck_threshold` cluster setting.

Concretely, DistSender waits until it has seen the first event on the
stream (since there is a semaphore on the server side that can limit
delay when the catch-up phase first emits data) and then activates the
protection mechanism.

The metric `distsender.rangefeed.restart_stuck` tracks how often this
fires, along with newly added telemetry
`rangefeed.stuck.{during,after}-catchup-scan`.

Note: the backport has some extra code not present on `master`:

When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event. We weren't previously marking this
path as a changefeed retry. Now we are.

`master` will need a proper fix for this as well.

`TestRestartStuckRangeFeeds` from this commit reliably shakes this out
under flake in only a few iterations.

Touches cockroachlabs/support#1729.

Release justification: stability improvement.
Release note (enterprise change): The new
`kv.rangefeed.range_stuck_threshold` (default 0, i.e. disabled) cluster
setting instructs RangeFeed clients (used internally by changefeeds) to
restart automatically if no checkpoint or other event has been received
from the server for some time. This is a defense-in-depth mechanism
which will log output as follows if triggered: restarting stuck
rangefeed: waiting for r100 (n1,s1):1 [threshold 1m]: rangefeed
restarting due to inactivity.
@tbg tbg force-pushed the backport22.1-86820 branch from 97264ca to 80f5534 Compare September 1, 2022 11:59
@tbg tbg marked this pull request as ready for review September 1, 2022 12:00
@tbg tbg requested a review from a team September 1, 2022 12:00
@tbg tbg requested a review from a team as a code owner September 1, 2022 12:00
@tbg tbg requested a review from erikgrinaker September 1, 2022 12:00
@tbg
Copy link
Member Author

tbg commented Sep 1, 2022

@erikgrinaker please refer to my Reviewable comments for guidance on how to review.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks for the additional effort here!

Reviewed 6 of 7 files at r1, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


-- commits line 26 at r2:

Previously, tbg (Tobias Grieger) wrote…

☝️

Nice find.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 534 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is new and I need something like this on master as well. The test discovered this.

Seems reasonable.

@tbg tbg merged commit a70cc0e into cockroachdb:release-22.1 Sep 5, 2022
@tbg tbg deleted the backport22.1-86820 branch September 5, 2022 09:16
tbg added a commit to tbg/cockroach that referenced this pull request Sep 12, 2022
Front-ports parts of cockroachdb#87253.

When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event.

We now handle this the same as the other case (where the stream client
errors out due to the cancellation).

This also checks in the test from
cockroachdb#87253 (which is on
release-22.1).

Fixes cockroachdb#87370.

No release note since this will be backported to release-22.2
Release note: None
craig bot pushed a commit that referenced this pull request Sep 12, 2022
87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling

This commit fixes a small bug on the transaction insight details page
that was incorrectly mapping the waiting transaction statement
fingerprints to the blocking transaction statements. The commit also
aligns the summary cards in the details page. The commit also removes
the contended key from the details page while we look for a more user-
friendly format to display row contention.

Before:

![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png)


After:

![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png)

Fixes #87838.

Release note: None
Release justification: bug fix

87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich

We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep.

Addresses: #87104.

Release note: None

87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner

The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes.

Informs #86790

Release note: None

87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg

Front-ports parts of #87253.

When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event.

We now handle this the same as the other case (where the stream client
errors out due to the cancellation).

This also checks in the test from
#87253 (which is on
release-22.1).

Fixes #87370.

No release note since this will be backported to release-22.2
Release note: None


Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Sep 12, 2022
Front-ports parts of #87253.

When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event.

We now handle this the same as the other case (where the stream client
errors out due to the cancellation).

This also checks in the test from
#87253 (which is on
release-22.1).

Fixes #87370.

No release note since this will be backported to release-22.2
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants