-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc/nodedialer: avoid tripping breaker on context errors #35433
rpc/nodedialer: avoid tripping breaker on context errors #35433
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/rpc/nodedialer/nodedialer.go, line 153 at r1 (raw file):
// that a context cancellation might imply something is amiss with the // connection it is not a clear enough signal. We will however let // ErrDeadlineExceeded trip the breaker.
Why? context.DeadlineExceeded
just means that the incoming context has a deadline or timeout. Why would that inform on the node's status? We don't want this code to trip the breaker:
ctx, cancel := context.WithTimeout(ctx, time.Second)
time.Sleep(990*time.Millisecond)
dialer.Dial(ctx, ...)
Maybe the right thing to do here is to only trip the breaker on gRPC errors (status.FromError(x)
, probably in conjunction with
cockroach/pkg/storage/queue.go
Lines 772 to 777 in 447fe7d
func isBenign(err error) bool { | |
return causer.Visit(err, func(err error) bool { | |
_, ok := err.(*benignError) | |
return ok | |
}) | |
} |
Then the code here would read
if isGRPCError(err) && ctx.Err() == nil {
...
}
We should make sure that our gRPC calls internally uses a sane connection timeout (I don't think the incoming ctx generally has one, so that's not a new want).
Taking a step back, I'm less confident that we need to grep out the grpc-originating errors because what other errors could there be, and wouldn't we want to circuit break on them as well?
1adec05
to
9aef6cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look. I'm going to write some tests now.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/nodedialer/nodedialer.go, line 153 at r1 (raw file):
Taking a step back, I'm less confident that we need to grep out the grpc-originating errors because what other errors could there be, and wouldn't we want to circuit break on them as well?
I agree with this idea. I've updated the diff to just record into the breaker if ctx.Err() == nil
There are a few other errors that could arise but roughly all of them relate to the node shutting down. We could get context errors from the underlying rpc.Context's masterCtx which gets cancelled upon shutdown. Furthermore we could just generally get errors from the stopper itself. None of those are really very interesting though because the node is already shutting down.
Sounds good, ping me when this is RFAL
…On Wed, Mar 6, 2019 at 4:52 PM ajwerner ***@***.***> wrote:
***@***.**** commented on this pull request.
Thanks for taking a look. I'm going to write some tests now.
*Reviewable
<https://reviewable.io/reviews/cockroachdb/cockroach/35433#-:-L_Ih5-H4HgsEGPTrK80:bklpl3d>*
status: [image: ] complete! 0 of 0 LGTMs obtained (waiting on @tbg
<https://github.com/tbg>)
------------------------------
*pkg/rpc/nodedialer/nodedialer.go, line 153 at r1
<https://reviewable.io/reviews/cockroachdb/cockroach/35433#-L_Hl0yjDpXx3r07FlYO:-L_IcW4C4QyZKChBHH5B:b-li1hhl>
(raw file
<https://github.com/cockroachdb/cockroach/blob/76c5ea1c44eea8f48ca2cbbc9ed0cb64d39f525b/pkg/rpc/nodedialer/nodedialer.go#L153>):*
Taking a step back, I'm less confident that we need to grep out the
grpc-originating errors because what other errors could there be, and
wouldn't we want to circuit break on them as well?
I agree with this idea. I've updated the diff to just record into the
breaker if ctx.Err() == nil
There are a few other errors that could arise but roughly all of them
relate to the node shutting down. We could get context errors from the
underlying rpc.Context's masterCtx which gets cancelled upon shutdown.
Furthermore we could just generally get errors from the stopper itself.
None of those are really very interesting though because the node is
already shutting down.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35433 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135Nrx9-KKmdjQszMYei-OhxDTOpMjks5vT-RVgaJpZM4bflfq>
.
|
9aef6cc
to
f93818b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing added, RFAL
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
f93818b
to
3f47307
Compare
cb5e70b
to
2e722cf
Compare
9eb6986
to
d70c21e
Compare
4338987
to
013d23c
Compare
One bummer about this testing is that it takes 3 seconds due to the hard-coded heartbeat interval in the rpc context. I've typed out code to plumb a setting through which enables making the test much shorter. Is that worth it? |
013d23c
to
72aeb7f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've typed out code to plumb a setting through which enables making the test much shorter. Is that worth it?
Yeah, I think so. (For example, we'll get many more iterations during nightly stress that way).
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)
pkg/rpc/nodedialer/nodedialer.go, line 142 at r2 (raw file):
defer func() { // Enforce a minimum interval between warnings for failed connections. if err != nil && err != ctx.Err() && breaker.ShouldLog() {
is this the right condition? If the ctx times out while trying to connect, we'll get some gRPC error back (wrapping a context error) and this will trivially be true. I'd go with if err != nil && ctx.Err() == nil && breaker.ShouldLog() { ... }
pkg/rpc/nodedialer/nodedialer_test.go, line 53 at r2 (raw file):
_, err := nd.Dial(ctx, 1) assert.Nil(t, err, "failed to dial") assert.True(t, breaker.Ready())
Would the breaker trip right away after one attempt? Want to also assert breaker.Failures == 0?
pkg/rpc/nodedialer/nodedialer_test.go, line 172 at r2 (raw file):
becomes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)
72aeb7f
to
4c9ad66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/nodedialer/nodedialer.go, line 142 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
is this the right condition? If the ctx times out while trying to connect, we'll get some gRPC error back (wrapping a context error) and this will trivially be true. I'd go with
if err != nil && ctx.Err() == nil && breaker.ShouldLog() { ... }
Done.
pkg/rpc/nodedialer/nodedialer_test.go, line 53 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Would the breaker trip right away after one attempt? Want to also assert breaker.Failures == 0?
Done. Our threshold is 1 failure for tripping.
Honestly this test was more to make sure that the testing infrastructure I set up wasn't busted, it doesn't really exercise much beyond that.
pkg/rpc/nodedialer/nodedialer_test.go, line 172 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
becomes
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to plumb the config.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)
4c9ad66
to
ba67b06
Compare
Okay, added a second commit which shortens the test. |
In cockroachdb#34026 we added logic to ensure that the context was not canclled before calling in to GRPCDial. This change extends that to also avoid calling breaker.Fail if the returned error was context.Canceled. The motivation for this is an observation that breakers can trip due to context cancellation which race with calls to Dial (from say DistSQL). This behavior was observed after a node died due to an unrelated corruption bug. It appears that this node failure triggered a context cancellation which then tripped a breaker which then lead to a different flow to fail which then lead to another cancellation which seems to have then tripped another breaker. The evidence for this exact serioes of events is somewhat scant but we do know for certain that we saw breakers tripped due to context cancelled which seems wrong. ``` ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled ``` This change also cosmetically refactors DialInternalClient and Dial to share some copy-pasted code which was becoming burdensome. Lastly this change adds a bunch of unit testing to the nodedialer package. Release note: None
This enables dramatically shortening the runtime of the nodedialer tests from over 3s to less than 50ms. Release note: None
ba67b06
to
2d0c3bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r3, 3 of 3 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained
I ran it in stress and stressrace for a bit and it seems happy. bors r+ |
Build failed |
bors r+ |
35350: README: remove links to gitter r=bdarnell a=bdarnell We're no longer providing support via this channel and want to direct users to the forum instead. 35433: rpc/nodedialer: avoid tripping breaker on context errors r=ajwerner a=ajwerner In #34026 we added logic to ensure that the context was not canclled before calling in to GRPCDial. This change extends that to also avoid calling breaker.Fail if the returned error was context.Canceled. The motivation for this is an observation that breakers can trip due to context cancellation which race with calls to Dial (from say DistSQL). This behavior was observed after a node died due to an unrelated corruption bug. It appears that this node failure triggered a context cancellation which then tripped a breaker which then lead to a different flow to fail which then lead to another cancellation which seems to have then tripped another breaker. The evidence for this exact serioes of events is somewhat scant but we do know for certain that we saw breakers tripped due to context cancelled which seems wrong. ``` ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled ``` This change also cosmetically refactors DialInternalClient and Dial to share some copy-pasted code which was becoming burdensome. Release note: None Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Build succeeded |
This PR backports the idea behind cockroachdb#35433 to release-2.0. Release note: None
This commit backports the intention behind cockroachdb#35433 to release v2.1.
In #34026 we added logic to ensure that the context was not canclled before
calling in to GRPCDial. This change extends that to also avoid calling
breaker.Fail if the returned error was context.Canceled.
The motivation for this is an observation that breakers can trip due to context
cancellation which race with calls to Dial (from say DistSQL). This behavior
was observed after a node died due to an unrelated corruption bug. It appears
that this node failure triggered a context cancellation which then tripped a
breaker which then lead to a different flow to fail which then lead to another
cancellation which seems to have then tripped another breaker. The evidence for
this exact serioes of events is somewhat scant but we do know for certain that
we saw breakers tripped due to context cancelled which seems wrong.
This change also cosmetically refactors DialInternalClient and Dial to share
some copy-pasted code which was becoming burdensome.
Release note: None