Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: restart stuck RangeFeeds #86820

Merged
merged 1 commit into from
Aug 31, 2022
Merged

Conversation

miretskiy
Copy link
Contributor

@miretskiy miretskiy commented Aug 24, 2022

It has been observed in the wild, that rangefeeds (and changefeeds
that use them) would appear to be stuck and not make any progress.
It has been determined that, for reasons yet unknown, the
RangeFeed RPC stops receiving events from the server, even though
the contract indicates that such events should always come in
since events also include range checkpoint records that should always
be emitted periodically.

This PR introduces a defense in depth mechanism to the client side range
feed library so that ranges that appear to be stuck are restarted
automatically, based on the value of the
kv.rangefeed.range_stuck_threshold cluster setting.

Touches https://github.com/cockroachlabs/support/issues/1729.

Release justification: stability improvement.
Release note (enterprise change): The new
kv.rangefeed.range_stuck_threshold (default 60s) cluster setting
instructs RangeFeed clients (used internally by changefeeds) to restart
automatically if no checkpoint or other event has been received from
the server for some time. This is a defense-in-depth mechanism.

@miretskiy miretskiy requested a review from a team as a code owner August 24, 2022 21:45
@miretskiy miretskiy requested a review from a team August 24, 2022 21:45
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy
Copy link
Contributor Author

miretskiy commented Aug 24, 2022

Informs #86820 #86818

@miretskiy miretskiy marked this pull request as draft August 24, 2022 21:46
@miretskiy
Copy link
Contributor Author

This is still a draft PR since I don't have any tests yet; but this shows the direction I'm thinking of...

@miretskiy miretskiy requested review from erikgrinaker and tbg August 24, 2022 21:52
@miretskiy
Copy link
Contributor Author

I'm open to suggestions, including closing this PR; Keeping in mind that the original escalation was not root caused
for a long time and that we have customers that are experiencing multiple stuckness events.

@miretskiy miretskiy force-pushed the unstuck branch 3 times, most recently from 496b7f7 to 570b76e Compare August 24, 2022 22:00
@erikgrinaker
Copy link
Contributor

It's unfortunate that we have to resort to this, but it seems pragmatic.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of this too. Left some comments but they're not substantial. Let's go ahead with this, just needs a little bit of cleaning up and testing as you mentioned. It's a shame we can't cheaply put a timeout on each individual Recv call, I'd love to have a few lines of code to backport.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @miretskiy)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 317 at r1 (raw file):

	}
	a.LastErr = errors.Wrapf(err, "disconnect with last checkpoint at %s (%s ago)",
		a.Resolved, sinceCheckpoint)

nit: sinceCheckpoint won't redact properly. Consider making it of type time.Time and accepting that it'll print a large number if a.Resolved is unset.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 328 at r1 (raw file):

	mu     struct {
		syncutil.Mutex
		heartbeats heartBeatHeap

The words "heartbeat" and "liveness" in the present context seem not ideal to me, since the mechanism here is passive - it's really checking that there is regular incoming activity on the RangeFeed. Consider renaming a la lastEvents lastEventsMap or the like, to scope it a little tighter.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 431 at r1 (raw file):

			maxTS, err = ds.singleRangeFeed(
				ctx, span, startAfter, withDiff, token.Desc(),
				catchupSem, eventCh, streamProducerFactory, cb)

will cb be invoked on every event? Are we worried about lock contention on the activeRangeFeed?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 483 at r1 (raw file):

			default:
				if reason := contextutil.GetCancelReason(ctx); reason != nil && errors.Is(reason, restartLivenessErr) {
					// Evict the descriptor from the cache and reload on next attempt.

What the comment says now is clear from the next two lines, but explain why we think this is a good idea now.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 779 at r1 (raw file):

	n := len(old)
	x := old[n-1]
	*h = old[0 : n-1]

gofmt?

@miretskiy
Copy link
Contributor Author

d. It's a shame we can't cheaply put a timeout on each individual Recv call, I'd love to have a few lines of code to backport.

I hear you; This approach, though arguable a lot more than few lines of code is backportable since a) it's disabled by default and b) existing code modified in very few places.

Thanks for taking a look -- I'll address your comments and add a test too.

@tbg
Copy link
Member

tbg commented Aug 25, 2022

Come to think of it, can we do this? (I'm using AfterFunc since I assume it's internally optimized to avoid a goroutine per timer):

diff --git a/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go b/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
index b7b8fad404..9be90322e1 100644
--- a/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
@@ -461,8 +461,32 @@ func (ds *DistSender) singleRangeFeed(
 			continue
 		}
 
+		const eventCheckInterval = time.Minute // from cluster setting
+
+		// We want to cancel the context if we don't receive an event
+		// in a while. We try to do this without doing too much work
+		// (allocations, etc) for each message received to bound the
+		// overhead in case everything is working fine and messages
+		// are potentially rushing in at high frequency. To do this,
+		// we set up a timer that would cancel the context, and
+		// whenever it is ~half expired, stop it (after the next
+		// message is there) and re-start it. That way, we allocate
+		// only ~twice per eventCheckInterval, which is acceptable.
+		clientCtx, cancel := context.WithCancel(clientCtx)
+		mkCancelTimer := func() *time.Timer {
+			return time.AfterFunc(eventCheckInterval, cancel)
+		}
+		timer := mkCancelTimer()
+
+		baseEventTS := timeutil.Now()
 		for {
 			event, err := stream.Recv()
+			now := timeutil.Now()
+			if now.Sub(baseEventTS) > eventCheckInterval/2 {
+				baseEventTS = now
+				timer.Stop()
+				timer = mkCancelTimer()
+			}
 			if err == io.EOF {
 				return args.Timestamp, nil
 			}

@miretskiy
Copy link
Contributor Author

miretskiy commented Aug 25, 2022 via email

@miretskiy
Copy link
Contributor Author

clientCtx, cancel := context.WithCancel(clientCtx)
+		mkCancelTimer := func() *time.Timer {
+			return time.AfterFunc(eventCheckInterval, cancel)
+		}
+		timer := mkCancelTimer()

I am concerned about time.AfterFunc.. isn't it creating go routines? Definitely, the solution is nice... but... has a cost of go routines... wdyt @tbg?

@miretskiy
Copy link
Contributor Author

I am concerned about time.AfterFunc.. isn't it creating go routines? Definitely, the solution is nice... but... has a cost of go routines... wdyt @tbg?

WithCancel also creates go routines -- so care must be taken to always guarantee to cancel() to free up resources...

@miretskiy
Copy link
Contributor Author

What the comment says now is clear from the next two lines, but explain why we think this is a good idea now.

Done.

@miretskiy
Copy link
Contributor Author

@tbg I took your suggestion; Still haven't tested/checked it though.
Decided to wrap it around a struct; it's a bit more code, but I want to react to settings changes without having to restart, so that required a bit more typing. But overall, it is much simpler. It does appear that AfterFunc is efficient as it uses
single queue in the runtime to trigger those. Of course, it might be possible to create too many of those calls, but
I think it's fine. Certainly, more complex solution can be revisited if needed.

@miretskiy miretskiy marked this pull request as ready for review August 25, 2022 18:29
@miretskiy miretskiy force-pushed the unstuck branch 2 times, most recently from 32707a1 to c9af40e Compare August 25, 2022 20:12
@miretskiy
Copy link
Contributor Author

@tbg : tests added; ready for review.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks great! I took the liberty to refactor it a bit more and to add some additional testing, PTAL at the additional commit (which I invite you to squash after having reviewed+edited it if necessary).

Could you confirm that there is logging when we cancel a stuck rangefeed? We need this to be obvious from the logs when it occurs, and the message should contain as much information as possible, i.e. the duration of the stall and which node:store:range the feed had been pointing at before it got torn down.

Reviewed 1 of 3 files at r2, 1 of 4 files at r4, 1 of 3 files at r5, 1 of 1 files at r6, 1 of 1 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @miretskiy, and @tbg)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 70 at r6 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Should this default to 0?

A one minute default is good. If a rangefeed is already stuck, the cluster setting doesn't work so well, and we are very confident (right?) that if a rangefeed doesn't checkpoint for a minute then it's basically toast.

Maybe one risk I can think of is that if the cluster is in a bad state, this might be triggering additional catch-up scans.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 427 at r7 (raw file):

				// defensive and attempt to restart this rangefeed. Usually, any
				// stuck-ness is cleared out if we just attempt to re-resolve range
				// descriptor and retry.

But wouldn't you want to log here? Or is it logged somewhere already? We need to know the (NodeID,StoreID,RangeID) triplet causing the problem


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 643 at r7 (raw file):

// sentinel error returned when cancelling rangefeed when it is stuck.
var errRestartStuckRange = errors.New("rangefeed restarting due to liveness")

Let's not use the word liveness here, which could be confused for node liveness. How about restarting stuck rangefeed after %s of inactivity.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 389 at r7 (raw file):

	if !b.wasStuck {
		ctx := b.Internal_RangeFeedClient.Context()
		<-ctx.Done()

Isn't this just going to deadlock if the canceler ever doesn't work? That's not a great failure mode, could you add a (say 10s) timeout here and if we hit that we don't set wasStuck but return an error anyway?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 414 at r7 (raw file):

			Knobs: base.TestingKnobs{
				KVClient: &kvcoord.ClientTestingKnobs{
					TransportFactory: makeTransportFactory(false, nil, wrapRfClient),

Isn't it unacceptable to block all rangefeeds here? For serverless parity, we need at least the rangefeeds that tenants use to pull from the server to work.

@miretskiy
Copy link
Contributor Author

Isn't it unacceptable to block all rangefeeds here? For serverless parity, we need at least the rangefeeds that tenants use to pull from the server to work.

Only feeds started by this test are blocked. The factory delegates to default implementation for all other feeds...

@tbg
Copy link
Member

tbg commented Aug 29, 2022

I'm going to push this over the line since Yevgeniy is on PTO now.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r8, all commit messages.
Dismissed @miretskiy from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 427 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

But wouldn't you want to log here? Or is it logged somewhere already? We need to know the (NodeID,StoreID,RangeID) triplet causing the problem

Added logging.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 643 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Let's not use the word liveness here, which could be confused for node liveness. How about restarting stuck rangefeed after %s of inactivity.

Made the change.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 389 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't this just going to deadlock if the canceler ever doesn't work? That's not a great failure mode, could you add a (say 10s) timeout here and if we hit that we don't set wasStuck but return an error anyway?

Added a timeout here.


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go line 414 at r7 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Isn't it unacceptable to block all rangefeeds here? For serverless parity, we need at least the rangefeeds that tenants use to pull from the server to work.

Yevgeniy explained on main thread that this does not affect the built-in rangefeeds.

@tbg tbg removed the request for review from a team August 29, 2022 12:06
@tbg tbg changed the title kv: Restart stuck RangeFeed. kvcoord: restart stuck RangeFeeds Aug 29, 2022
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing but nits, thanks!

Reviewed 8 of 8 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/kv/kvclient/kvcoord/dist_sender.go line 169 at r9 (raw file):

	metaDistSenderRangefeedRestartStuck = metric.Metadata{
		Name: "distsender.rangefeed.restart_stuck",
		Help: `Number of times a RangeFeed was restarted due to not receiving ` +

nit: RangeFeed is not a proper noun, so "rangefeed".


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 72 at r9 (raw file):

	"kv.rangefeed.range_stuck_threshold",
	"restart rangefeeds if they appear to be stuck for the specified threshold; 0 disables",
	time.Minute,

Will the backport be default-off? Also, should we make this public?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go line 25 at r9 (raw file):

// not received events from the KV layer in some time. Rangefeeds are supposed to receive
// regular updates, as at the very least they ought to be receiving closed timestamps.
// However, issues[^1] at the KV layer could prevent this.

nit: should there be a link here?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go line 35 at r9 (raw file):

// be large enough (i.e. at least a couple of seconds) to make this negligible.
// Concretely, a timer is set that would invoke the cancellation, and the timer
// is reset on the first call to ping() after the timer is at least half

So on average, this would fire when 3/4 of the interval has elapsed without an event? Might be worth mentioning this in the setting description somehow. I'm also not sure if we should have the setting be a lower bound rather than an upper bound (which it currently is).

@tbg tbg requested a review from a team August 30, 2022 15:01
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 fixed up CI (missing chart catalog update) and will merge it tomorrow assuming it turns green.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @miretskiy, and @tbg)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go line 35 at r9 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

So on average, this would fire when 3/4 of the interval has elapsed without an event? Might be worth mentioning this in the setting description somehow. I'm also not sure if we should have the setting be a lower bound rather than an upper bound (which it currently is).

No, the timer fires when the duration has elapsed. We refresh the timer if we see an event while it is at least half expired. It is a lower bound - if you set the setting to 60s, then you will never get canceled unless a Recv call took >=60s.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @tbg)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go line 35 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

No, the timer fires when the duration has elapsed. We refresh the timer if we see an event while it is at least half expired. It is a lower bound - if you set the setting to 60s, then you will never get canceled unless a Recv call took >=60s.

No? Let's say we set it to 60s:

T=0:   ping
T=29: ping (does not reset timer, because 29 < 30)
T=60: timer fires, rangefeed canceled

In this case, the rangefeed was disconnected after 31 seconds of inactivity, even though the setting says 60s.

@tbg
Copy link
Member

tbg commented Aug 30, 2022

Duh, thanks for spelling it out. I'll fix this up tomorrow with an internal multiplier of 21.5.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 8 files at r9, 5 of 5 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

It has been observed in the wild, that rangefeeds (and changefeeds
that use them) would appear to be stuck and not make any progress.
It has been determined that, for reasons yet unknown, the
RangeFeed RPC stops receiving events from the server, even though
the contract indicates that such events should always come in
since events also include range checkpoint records that should always
be emitted periodically.

This PR introduces a defense in depth mechanism to the client side range
feed library so that ranges that appear to be stuck are restarted
automatically, based on the value of the
`kv.rangefeed.range_stuck_threshold` cluster setting.

Concretely, DistSender waits until it has seen the first event on the
stream (since there is a semaphore on the server side that can limit
delay when the catch-up phase first emits data) and then activates the
protection mechanism.

The metric `distsender.rangefeed.restart_stuck` tracks how often this
fires, along with newly added telemetry
`rangefeed.stuck.{during,after}-catchup-scan`.

Touches cockroachlabs/support#1729.

Release justification: stability improvement.
Release note (enterprise change): The new
`kv.rangefeed.range_stuck_threshold` (default 60s) cluster setting
instructs RangeFeed clients (used internally by changefeeds) to restart
automatically if no checkpoint or other event has been received from
the server for some time. This is a defense-in-depth mechanism which
will log output as follows if triggered: restarting stuck rangefeed:
waiting for r100 (n1,s1):1 [threshold 1m]: rangefeed restarting due to
inactivity.
@tbg
Copy link
Member

tbg commented Aug 31, 2022

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Aug 31, 2022

Build succeeded:

@craig craig bot merged commit 7bfd9f8 into cockroachdb:master Aug 31, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 31, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 5a012d8 to blathers/backport-release-21.2-86820: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


error creating merge commit from 5a012d8 to blathers/backport-deprecate-release-22.1.0-86820: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.0 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants