-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc/nodedialer: check context for cancellation before dialing #34026
rpc/nodedialer: check context for cancellation before dialing #34026
Conversation
Before this PR we would sometimes call in to the nodedialer with a canceled context and inadvertently end up recording a failure into the circuit breaker. This behavior led to some amount of noise when debugging cockroachdb#31361. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status: complete! 0 of 0 LGTMs obtained
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That explains a thing or two.
bors r+ |
34026: rpc/nodedialer: check context for cancellation before dialing r=ajwerner a=ajwerner Before this PR we would sometimes call in to the nodedialer with a canceled context and inadvertently end up recording a failure into the circuit breaker. This behavior led to some amount of noise when debugging #31361. Release note: None Co-authored-by: Andrew Werner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
Build succeeded |
Marked DNM because Testing is coming, nodedialer currently has no testing and I'm going to change that. Just want to get this change out there to start the conversation and make myself more visibly accountable. In cockroachdb#34026 we added logic to ensure that the context was not canclled before calling in to GRPCDial. This change extends that to also avoid calling breaker.Fail if the returned error was context.Canceled. The motivation for this is an observation that breakers can trip due to context cancellation which race with calls to Dial (from say DistSQL). This behavior was observed after a node died due to an unrelated corruption bug. It appears that this node failure triggered a context cancellation which then tripped a breaker which then lead to a different flow to fail which then lead to another cancellation which seems to have then tripped another breaker. The evidence for this exact serioes of events is somewhat scant but we do know for certain that we saw breakers tripped due to context cancelled which seems wrong. ``` ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled ``` This change also cosmetically refactors DialInternalClient and Dial to share some copy-pasted code which was becoming burdensome. Release note: None
In cockroachdb#34026 we added logic to ensure that the context was not canclled before calling in to GRPCDial. This change extends that to also avoid calling breaker.Fail if the returned error was context.Canceled. The motivation for this is an observation that breakers can trip due to context cancellation which race with calls to Dial (from say DistSQL). This behavior was observed after a node died due to an unrelated corruption bug. It appears that this node failure triggered a context cancellation which then tripped a breaker which then lead to a different flow to fail which then lead to another cancellation which seems to have then tripped another breaker. The evidence for this exact serioes of events is somewhat scant but we do know for certain that we saw breakers tripped due to context cancelled which seems wrong. ``` ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled ``` This change also cosmetically refactors DialInternalClient and Dial to share some copy-pasted code which was becoming burdensome. Lastly this change adds a bunch of unit testing to the nodedialer package. Release note: None
In cockroachdb#34026 we added logic to ensure that the context was not canclled before calling in to GRPCDial. This change extends that to also avoid calling breaker.Fail if the returned error was context.Canceled. The motivation for this is an observation that breakers can trip due to context cancellation which race with calls to Dial (from say DistSQL). This behavior was observed after a node died due to an unrelated corruption bug. It appears that this node failure triggered a context cancellation which then tripped a breaker which then lead to a different flow to fail which then lead to another cancellation which seems to have then tripped another breaker. The evidence for this exact serioes of events is somewhat scant but we do know for certain that we saw breakers tripped due to context cancelled which seems wrong. ``` ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled ``` This change also cosmetically refactors DialInternalClient and Dial to share some copy-pasted code which was becoming burdensome. Lastly this change adds a bunch of unit testing to the nodedialer package. Release note: None
35350: README: remove links to gitter r=bdarnell a=bdarnell We're no longer providing support via this channel and want to direct users to the forum instead. 35433: rpc/nodedialer: avoid tripping breaker on context errors r=ajwerner a=ajwerner In #34026 we added logic to ensure that the context was not canclled before calling in to GRPCDial. This change extends that to also avoid calling breaker.Fail if the returned error was context.Canceled. The motivation for this is an observation that breakers can trip due to context cancellation which race with calls to Dial (from say DistSQL). This behavior was observed after a node died due to an unrelated corruption bug. It appears that this node failure triggered a context cancellation which then tripped a breaker which then lead to a different flow to fail which then lead to another cancellation which seems to have then tripped another breaker. The evidence for this exact serioes of events is somewhat scant but we do know for certain that we saw breakers tripped due to context cancelled which seems wrong. ``` ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled ``` This change also cosmetically refactors DialInternalClient and Dial to share some copy-pasted code which was becoming burdensome. Release note: None Co-authored-by: Ben Darnell <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Before this PR we would sometimes call in to the nodedialer with a canceled
context and inadvertently end up recording a failure into the circuit breaker.
This behavior led to some amount of noise when debugging #31361.
Release note: None