-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: restart stuck RangeFeeds #86820
Conversation
This is still a draft PR since I don't have any tests yet; but this shows the direction I'm thinking of... |
I'm open to suggestions, including closing this PR; Keeping in mind that the original escalation was not root caused |
496b7f7
to
570b76e
Compare
It's unfortunate that we have to resort to this, but it seems pragmatic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of this too. Left some comments but they're not substantial. Let's go ahead with this, just needs a little bit of cleaning up and testing as you mentioned. It's a shame we can't cheaply put a timeout on each individual Recv
call, I'd love to have a few lines of code to backport.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @miretskiy)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 317 at r1 (raw file):
} a.LastErr = errors.Wrapf(err, "disconnect with last checkpoint at %s (%s ago)", a.Resolved, sinceCheckpoint)
nit: sinceCheckpoint won't redact properly. Consider making it of type time.Time
and accepting that it'll print a large number if a.Resolved is unset.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 328 at r1 (raw file):
mu struct { syncutil.Mutex heartbeats heartBeatHeap
The words "heartbeat" and "liveness" in the present context seem not ideal to me, since the mechanism here is passive - it's really checking that there is regular incoming activity on the RangeFeed. Consider renaming a la lastEvents lastEventsMap
or the like, to scope it a little tighter.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 431 at r1 (raw file):
maxTS, err = ds.singleRangeFeed( ctx, span, startAfter, withDiff, token.Desc(), catchupSem, eventCh, streamProducerFactory, cb)
will cb
be invoked on every event? Are we worried about lock contention on the activeRangeFeed
?
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 483 at r1 (raw file):
default: if reason := contextutil.GetCancelReason(ctx); reason != nil && errors.Is(reason, restartLivenessErr) { // Evict the descriptor from the cache and reload on next attempt.
What the comment says now is clear from the next two lines, but explain why we think this is a good idea now.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 779 at r1 (raw file):
n := len(old) x := old[n-1] *h = old[0 : n-1]
gofmt?
I hear you; This approach, though arguable a lot more than few lines of code is backportable since a) it's disabled by default and b) existing code modified in very few places. Thanks for taking a look -- I'll address your comments and add a test too. |
Come to think of it, can we do this? (I'm using AfterFunc since I assume it's internally optimized to avoid a goroutine per timer): diff --git a/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go b/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
index b7b8fad404..9be90322e1 100644
--- a/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
@@ -461,8 +461,32 @@ func (ds *DistSender) singleRangeFeed(
continue
}
+ const eventCheckInterval = time.Minute // from cluster setting
+
+ // We want to cancel the context if we don't receive an event
+ // in a while. We try to do this without doing too much work
+ // (allocations, etc) for each message received to bound the
+ // overhead in case everything is working fine and messages
+ // are potentially rushing in at high frequency. To do this,
+ // we set up a timer that would cancel the context, and
+ // whenever it is ~half expired, stop it (after the next
+ // message is there) and re-start it. That way, we allocate
+ // only ~twice per eventCheckInterval, which is acceptable.
+ clientCtx, cancel := context.WithCancel(clientCtx)
+ mkCancelTimer := func() *time.Timer {
+ return time.AfterFunc(eventCheckInterval, cancel)
+ }
+ timer := mkCancelTimer()
+
+ baseEventTS := timeutil.Now()
for {
event, err := stream.Recv()
+ now := timeutil.Now()
+ if now.Sub(baseEventTS) > eventCheckInterval/2 {
+ baseEventTS = now
+ timer.Stop()
+ timer = mkCancelTimer()
+ }
if err == io.EOF {
return args.Timestamp, nil
}
|
That's very interesting, @tbg
…On Thu, Aug 25, 2022 at 10:18 AM Tobias Grieger ***@***.***> wrote:
Come to think of it, can we do this:
diff --git a/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go b/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
index b7b8fad404..9be90322e1 100644--- a/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go+++ b/pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go@@ -461,8 +461,32 @@ func (ds *DistSender) singleRangeFeed(
continue
}
+ const eventCheckInterval = time.Minute // from cluster setting++ // We want to cancel the context if we don't receive an event+ // in a while. We try to do this without doing too much work+ // (allocations, etc) for each message received to bound the+ // overhead in case everything is working fine and messages+ // are potentially rushing in at high frequency. To do this,+ // we set up a timer that would cancel the context, and+ // whenever it is ~half expired, stop it (after the next+ // message is there) and re-start it. That way, we allocate+ // only ~twice per eventCheckInterval, which is acceptable.+ clientCtx, cancel := context.WithCancel(clientCtx)+ mkCancelTimer := func() *time.Timer {+ return time.AfterFunc(eventCheckInterval, cancel)+ }+ timer := mkCancelTimer()++ baseEventTS := timeutil.Now()
for {
event, err := stream.Recv()+ now := timeutil.Now()+ if now.Sub(baseEventTS) > eventCheckInterval/2 {+ baseEventTS = now+ timer.Stop()+ timer = mkCancelTimer()+ }
if err == io.EOF {
return args.Timestamp, nil
}
—
Reply to this email directly, view it on GitHub
<#86820 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANA4FVADRLKH6FJDFWP45J3V256FBANCNFSM57Q35F3Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I am concerned about time.AfterFunc.. isn't it creating go routines? Definitely, the solution is nice... but... has a cost of go routines... wdyt @tbg? |
WithCancel also creates go routines -- so care must be taken to always guarantee to cancel() to free up resources... |
Done. |
@tbg I took your suggestion; Still haven't tested/checked it though. |
32707a1
to
c9af40e
Compare
@tbg : tests added; ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks great! I took the liberty to refactor it a bit more and to add some additional testing, PTAL at the additional commit (which I invite you to squash after having reviewed+edited it if necessary).
Could you confirm that there is logging when we cancel a stuck rangefeed? We need this to be obvious from the logs when it occurs, and the message should contain as much information as possible, i.e. the duration of the stall and which node:store:range the feed had been pointing at before it got torn down.
Reviewed 1 of 3 files at r2, 1 of 4 files at r4, 1 of 3 files at r5, 1 of 1 files at r6, 1 of 1 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @miretskiy, and @tbg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 70 at r6 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Should this default to 0?
A one minute default is good. If a rangefeed is already stuck, the cluster setting doesn't work so well, and we are very confident (right?) that if a rangefeed doesn't checkpoint for a minute then it's basically toast.
Maybe one risk I can think of is that if the cluster is in a bad state, this might be triggering additional catch-up scans.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 427 at r7 (raw file):
// defensive and attempt to restart this rangefeed. Usually, any // stuck-ness is cleared out if we just attempt to re-resolve range // descriptor and retry.
But wouldn't you want to log here? Or is it logged somewhere already? We need to know the (NodeID,StoreID,RangeID) triplet causing the problem
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 643 at r7 (raw file):
// sentinel error returned when cancelling rangefeed when it is stuck. var errRestartStuckRange = errors.New("rangefeed restarting due to liveness")
Let's not use the word liveness here, which could be confused for node liveness. How about restarting stuck rangefeed after %s of inactivity
.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go
line 389 at r7 (raw file):
if !b.wasStuck { ctx := b.Internal_RangeFeedClient.Context() <-ctx.Done()
Isn't this just going to deadlock if the canceler ever doesn't work? That's not a great failure mode, could you add a (say 10s) timeout here and if we hit that we don't set wasStuck
but return an error anyway?
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go
line 414 at r7 (raw file):
Knobs: base.TestingKnobs{ KVClient: &kvcoord.ClientTestingKnobs{ TransportFactory: makeTransportFactory(false, nil, wrapRfClient),
Isn't it unacceptable to block all rangefeeds here? For serverless parity, we need at least the rangefeeds that tenants use to pull from the server to work.
Only feeds started by this test are blocked. The factory delegates to default implementation for all other feeds... |
I'm going to push this over the line since Yevgeniy is on PTO now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r8, all commit messages.
Dismissed @miretskiy from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 427 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
But wouldn't you want to log here? Or is it logged somewhere already? We need to know the (NodeID,StoreID,RangeID) triplet causing the problem
Added logging.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 643 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Let's not use the word liveness here, which could be confused for node liveness. How about
restarting stuck rangefeed after %s of inactivity
.
Made the change.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go
line 389 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't this just going to deadlock if the canceler ever doesn't work? That's not a great failure mode, could you add a (say 10s) timeout here and if we hit that we don't set
wasStuck
but return an error anyway?
Added a timeout here.
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_test.go
line 414 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't it unacceptable to block all rangefeeds here? For serverless parity, we need at least the rangefeeds that tenants use to pull from the server to work.
Yevgeniy explained on main thread that this does not affect the built-in rangefeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing but nits, thanks!
Reviewed 8 of 8 files at r9, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/kv/kvclient/kvcoord/dist_sender.go
line 169 at r9 (raw file):
metaDistSenderRangefeedRestartStuck = metric.Metadata{ Name: "distsender.rangefeed.restart_stuck", Help: `Number of times a RangeFeed was restarted due to not receiving ` +
nit: RangeFeed is not a proper noun, so "rangefeed".
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go
line 72 at r9 (raw file):
"kv.rangefeed.range_stuck_threshold", "restart rangefeeds if they appear to be stuck for the specified threshold; 0 disables", time.Minute,
Will the backport be default-off? Also, should we make this public?
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go
line 25 at r9 (raw file):
// not received events from the KV layer in some time. Rangefeeds are supposed to receive // regular updates, as at the very least they ought to be receiving closed timestamps. // However, issues[^1] at the KV layer could prevent this.
nit: should there be a link here?
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go
line 35 at r9 (raw file):
// be large enough (i.e. at least a couple of seconds) to make this negligible. // Concretely, a timer is set that would invoke the cancellation, and the timer // is reset on the first call to ping() after the timer is at least half
So on average, this would fire when 3/4 of the interval has elapsed without an event? Might be worth mentioning this in the setting description somehow. I'm also not sure if we should have the setting be a lower bound rather than an upper bound (which it currently is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 fixed up CI (missing chart catalog update) and will merge it tomorrow assuming it turns green.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @miretskiy, and @tbg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go
line 35 at r9 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
So on average, this would fire when 3/4 of the interval has elapsed without an event? Might be worth mentioning this in the setting description somehow. I'm also not sure if we should have the setting be a lower bound rather than an upper bound (which it currently is).
No, the timer fires when the duration has elapsed. We refresh the timer if we see an event while it is at least half expired. It is a lower bound - if you set the setting to 60s, then you will never get canceled unless a Recv call took >=60s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r10, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @tbg)
pkg/kv/kvclient/kvcoord/dist_sender_rangefeed_canceler.go
line 35 at r9 (raw file):
Previously, tbg (Tobias Grieger) wrote…
No, the timer fires when the duration has elapsed. We refresh the timer if we see an event while it is at least half expired. It is a lower bound - if you set the setting to 60s, then you will never get canceled unless a Recv call took >=60s.
No? Let's say we set it to 60s:
T=0: ping
T=29: ping (does not reset timer, because 29 < 30)
T=60: timer fires, rangefeed canceled
In this case, the rangefeed was disconnected after 31 seconds of inactivity, even though the setting says 60s.
Duh, thanks for spelling it out. I'll fix this up tomorrow with an internal multiplier of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 8 files at r9, 5 of 5 files at r10, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
It has been observed in the wild, that rangefeeds (and changefeeds that use them) would appear to be stuck and not make any progress. It has been determined that, for reasons yet unknown, the RangeFeed RPC stops receiving events from the server, even though the contract indicates that such events should always come in since events also include range checkpoint records that should always be emitted periodically. This PR introduces a defense in depth mechanism to the client side range feed library so that ranges that appear to be stuck are restarted automatically, based on the value of the `kv.rangefeed.range_stuck_threshold` cluster setting. Concretely, DistSender waits until it has seen the first event on the stream (since there is a semaphore on the server side that can limit delay when the catch-up phase first emits data) and then activates the protection mechanism. The metric `distsender.rangefeed.restart_stuck` tracks how often this fires, along with newly added telemetry `rangefeed.stuck.{during,after}-catchup-scan`. Touches cockroachlabs/support#1729. Release justification: stability improvement. Release note (enterprise change): The new `kv.rangefeed.range_stuck_threshold` (default 60s) cluster setting instructs RangeFeed clients (used internally by changefeeds) to restart automatically if no checkpoint or other event has been received from the server for some time. This is a defense-in-depth mechanism which will log output as follows if triggered: restarting stuck rangefeed: waiting for r100 (n1,s1):1 [threshold 1m]: rangefeed restarting due to inactivity.
bors r=erikgrinaker |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 5a012d8 to blathers/backport-release-21.2-86820: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. error creating merge commit from 5a012d8 to blathers/backport-deprecate-release-22.1.0-86820: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1.0 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
It has been observed in the wild, that rangefeeds (and changefeeds
that use them) would appear to be stuck and not make any progress.
It has been determined that, for reasons yet unknown, the
RangeFeed RPC stops receiving events from the server, even though
the contract indicates that such events should always come in
since events also include range checkpoint records that should always
be emitted periodically.
This PR introduces a defense in depth mechanism to the client side range
feed library so that ranges that appear to be stuck are restarted
automatically, based on the value of the
kv.rangefeed.range_stuck_threshold
cluster setting.Touches https://github.com/cockroachlabs/support/issues/1729.
Release justification: stability improvement.
Release note (enterprise change): The new
kv.rangefeed.range_stuck_threshold
(default 60s) cluster settinginstructs RangeFeed clients (used internally by changefeeds) to restart
automatically if no checkpoint or other event has been received from
the server for some time. This is a defense-in-depth mechanism.