Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc/nodedialer: check context for cancellation before dialing #34026

Conversation

ajwerner
Copy link
Contributor

Before this PR we would sometimes call in to the nodedialer with a canceled
context and inadvertently end up recording a failure into the circuit breaker.
This behavior led to some amount of noise when debugging #31361.

Release note: None

Before this PR we would sometimes call in to the nodedialer with a canceled
context and inadvertently end up recording a failure into the circuit breaker.
This behavior led to some amount of noise when debugging cockroachdb#31361.

Release note: None
@ajwerner ajwerner requested review from bdarnell, andreimatei and a team January 15, 2019 18:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explains a thing or two.

@ajwerner
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Jan 15, 2019
34026: rpc/nodedialer: check context for cancellation before dialing r=ajwerner a=ajwerner

Before this PR we would sometimes call in to the nodedialer with a canceled
context and inadvertently end up recording a failure into the circuit breaker.
This behavior led to some amount of noise when debugging #31361.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: nice find!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@craig
Copy link
Contributor

craig bot commented Jan 15, 2019

Build succeeded

@craig craig bot merged commit 8636a0c into cockroachdb:master Jan 15, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 5, 2019
Marked DNM because Testing is coming, nodedialer currently has no testing and
I'm going to change that. Just want to get this change out there to start the
conversation and make myself more visibly accountable.

In cockroachdb#34026 we added logic to ensure that the context was not canclled before
calling in to GRPCDial. This change extends that to also avoid calling
breaker.Fail if the returned error was context.Canceled.

The motivation for this is an observation that breakers can trip due to context
cancellation which race with calls to Dial (from say DistSQL). This behavior
was observed after a node died due to an unrelated corruption bug. It appears
that this node failure triggered a context cancellation which then tripped a
breaker which then lead to a different flow to fail which then lead to another
cancellation which seems to have then tripped another breaker. The evidence for
this exact serioes of events is somewhat scant but we do know for certain that
we saw breakers tripped due to context cancelled which seems wrong.

```
ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled
```

This change also cosmetically refactors DialInternalClient and Dial to share
some copy-pasted code which was becoming burdensome.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 7, 2019
In cockroachdb#34026 we added logic to ensure that the context was not canclled before
calling in to GRPCDial. This change extends that to also avoid calling
breaker.Fail if the returned error was context.Canceled.

The motivation for this is an observation that breakers can trip due to context
cancellation which race with calls to Dial (from say DistSQL). This behavior
was observed after a node died due to an unrelated corruption bug. It appears
that this node failure triggered a context cancellation which then tripped a
breaker which then lead to a different flow to fail which then lead to another
cancellation which seems to have then tripped another breaker. The evidence for
this exact serioes of events is somewhat scant but we do know for certain that
we saw breakers tripped due to context cancelled which seems wrong.

```
ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled
```

This change also cosmetically refactors DialInternalClient and Dial to share
some copy-pasted code which was becoming burdensome.

Lastly this change adds a bunch of unit testing to the nodedialer package.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 8, 2019
In cockroachdb#34026 we added logic to ensure that the context was not canclled before
calling in to GRPCDial. This change extends that to also avoid calling
breaker.Fail if the returned error was context.Canceled.

The motivation for this is an observation that breakers can trip due to context
cancellation which race with calls to Dial (from say DistSQL). This behavior
was observed after a node died due to an unrelated corruption bug. It appears
that this node failure triggered a context cancellation which then tripped a
breaker which then lead to a different flow to fail which then lead to another
cancellation which seems to have then tripped another breaker. The evidence for
this exact serioes of events is somewhat scant but we do know for certain that
we saw breakers tripped due to context cancelled which seems wrong.

```
ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled
```

This change also cosmetically refactors DialInternalClient and Dial to share
some copy-pasted code which was becoming burdensome.

Lastly this change adds a bunch of unit testing to the nodedialer package.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 8, 2019
35350: README: remove links to gitter r=bdarnell a=bdarnell

We're no longer providing support via this channel and want to direct users to the forum instead.

35433: rpc/nodedialer: avoid tripping breaker on context errors r=ajwerner a=ajwerner

In #34026 we added logic to ensure that the context was not canclled before
calling in to GRPCDial. This change extends that to also avoid calling
breaker.Fail if the returned error was context.Canceled.

The motivation for this is an observation that breakers can trip due to context
cancellation which race with calls to Dial (from say DistSQL). This behavior
was observed after a node died due to an unrelated corruption bug. It appears
that this node failure triggered a context cancellation which then tripped a
breaker which then lead to a different flow to fail which then lead to another
cancellation which seems to have then tripped another breaker. The evidence for
this exact serioes of events is somewhat scant but we do know for certain that
we saw breakers tripped due to context cancelled which seems wrong.

```
ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled
```

This change also cosmetically refactors DialInternalClient and Dial to share
some copy-pasted code which was becoming burdensome.

Release note: None

Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants