Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc/nodedialer: avoid tripping breaker on context errors #35433

Merged

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Mar 5, 2019

In #34026 we added logic to ensure that the context was not canclled before
calling in to GRPCDial. This change extends that to also avoid calling
breaker.Fail if the returned error was context.Canceled.

The motivation for this is an observation that breakers can trip due to context
cancellation which race with calls to Dial (from say DistSQL). This behavior
was observed after a node died due to an unrelated corruption bug. It appears
that this node failure triggered a context cancellation which then tripped a
breaker which then lead to a different flow to fail which then lead to another
cancellation which seems to have then tripped another breaker. The evidence for
this exact serioes of events is somewhat scant but we do know for certain that
we saw breakers tripped due to context cancelled which seems wrong.

ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled

This change also cosmetically refactors DialInternalClient and Dial to share
some copy-pasted code which was becoming burdensome.

Release note: None

@ajwerner ajwerner requested a review from a team March 5, 2019 21:57
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/rpc/nodedialer/nodedialer.go, line 153 at r1 (raw file):

		// that a context cancellation might imply something is amiss with the
		// connection it is not a clear enough signal. We will however let
		// ErrDeadlineExceeded trip the breaker.

Why? context.DeadlineExceeded just means that the incoming context has a deadline or timeout. Why would that inform on the node's status? We don't want this code to trip the breaker:

ctx, cancel := context.WithTimeout(ctx, time.Second)
time.Sleep(990*time.Millisecond)
dialer.Dial(ctx, ...)

Maybe the right thing to do here is to only trip the breaker on gRPC errors (status.FromError(x), probably in conjunction with

func isBenign(err error) bool {
return causer.Visit(err, func(err error) bool {
_, ok := err.(*benignError)
return ok
})
}
to unwind causers should they be used somewhere between grpc and this callsite).

Then the code here would read

if isGRPCError(err) && ctx.Err() == nil {
...
}

We should make sure that our gRPC calls internally uses a sane connection timeout (I don't think the incoming ctx generally has one, so that's not a new want).

Taking a step back, I'm less confident that we need to grep out the grpc-originating errors because what other errors could there be, and wouldn't we want to circuit break on them as well?

@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch 2 times, most recently from 1adec05 to 9aef6cc Compare March 6, 2019 15:52
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look. I'm going to write some tests now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 153 at r1 (raw file):

Taking a step back, I'm less confident that we need to grep out the grpc-originating errors because what other errors could there be, and wouldn't we want to circuit break on them as well?

I agree with this idea. I've updated the diff to just record into the breaker if ctx.Err() == nil

There are a few other errors that could arise but roughly all of them relate to the node shutting down. We could get context errors from the underlying rpc.Context's masterCtx which gets cancelled upon shutdown. Furthermore we could just generally get errors from the stopper itself. None of those are really very interesting though because the node is already shutting down.

@tbg
Copy link
Member

tbg commented Mar 6, 2019 via email

@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch from 9aef6cc to f93818b Compare March 6, 2019 21:53
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing added, RFAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

@ajwerner ajwerner changed the title [DNM] rpc/nodedialer: avoid tripping context on context.Canceled rpc/nodedialer: avoid tripping context on context.Canceled Mar 6, 2019
@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch from f93818b to 3f47307 Compare March 6, 2019 21:57
@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch 2 times, most recently from cb5e70b to 2e722cf Compare March 6, 2019 23:04
@ajwerner ajwerner changed the title rpc/nodedialer: avoid tripping context on context.Canceled rpc/nodedialer: avoid tripping breaker on context cancellation Mar 7, 2019
@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch 2 times, most recently from 9eb6986 to d70c21e Compare March 7, 2019 03:05
@ajwerner ajwerner changed the title rpc/nodedialer: avoid tripping breaker on context cancellation rpc/nodedialer: avoid tripping breaker on context errors Mar 7, 2019
@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch 2 times, most recently from 4338987 to 013d23c Compare March 7, 2019 15:13
@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 7, 2019

One bummer about this testing is that it takes 3 seconds due to the hard-coded heartbeat interval in the rpc context. I've typed out code to plumb a setting through which enables making the test much shorter. Is that worth it?

@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch from 013d23c to 72aeb7f Compare March 7, 2019 15:56
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've typed out code to plumb a setting through which enables making the test much shorter. Is that worth it?

Yeah, I think so. (For example, we'll get many more iterations during nightly stress that way).

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 142 at r2 (raw file):

	defer func() {
		// Enforce a minimum interval between warnings for failed connections.
		if err != nil && err != ctx.Err() && breaker.ShouldLog() {

is this the right condition? If the ctx times out while trying to connect, we'll get some gRPC error back (wrapping a context error) and this will trivially be true. I'd go with if err != nil && ctx.Err() == nil && breaker.ShouldLog() { ... }


pkg/rpc/nodedialer/nodedialer_test.go, line 53 at r2 (raw file):

	_, err := nd.Dial(ctx, 1)
	assert.Nil(t, err, "failed to dial")
	assert.True(t, breaker.Ready())

Would the breaker trip right away after one attempt? Want to also assert breaker.Failures == 0?


pkg/rpc/nodedialer/nodedialer_test.go, line 172 at r2 (raw file):

becomes

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner)

@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch from 72aeb7f to 4c9ad66 Compare March 7, 2019 16:33
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 142 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

is this the right condition? If the ctx times out while trying to connect, we'll get some gRPC error back (wrapping a context error) and this will trivially be true. I'd go with if err != nil && ctx.Err() == nil && breaker.ShouldLog() { ... }

Done.


pkg/rpc/nodedialer/nodedialer_test.go, line 53 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Would the breaker trip right away after one attempt? Want to also assert breaker.Failures == 0?

Done. Our threshold is 1 failure for tripping.

Honestly this test was more to make sure that the testing infrastructure I set up wasn't busted, it doesn't really exercise much beyond that.


pkg/rpc/nodedialer/nodedialer_test.go, line 172 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

becomes

Done.

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to plumb the config.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)

@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch from 4c9ad66 to ba67b06 Compare March 7, 2019 22:00
@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 7, 2019

Okay, added a second commit which shortens the test.

In cockroachdb#34026 we added logic to ensure that the context was not canclled before
calling in to GRPCDial. This change extends that to also avoid calling
breaker.Fail if the returned error was context.Canceled.

The motivation for this is an observation that breakers can trip due to context
cancellation which race with calls to Dial (from say DistSQL). This behavior
was observed after a node died due to an unrelated corruption bug. It appears
that this node failure triggered a context cancellation which then tripped a
breaker which then lead to a different flow to fail which then lead to another
cancellation which seems to have then tripped another breaker. The evidence for
this exact serioes of events is somewhat scant but we do know for certain that
we saw breakers tripped due to context cancelled which seems wrong.

```
ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled
```

This change also cosmetically refactors DialInternalClient and Dial to share
some copy-pasted code which was becoming burdensome.

Lastly this change adds a bunch of unit testing to the nodedialer package.

Release note: None
This enables dramatically shortening the runtime of the nodedialer tests from
over 3s to less than 50ms.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/dont-break-circuits-if-cancelled branch from ba67b06 to 2d0c3bf Compare March 8, 2019 02:30
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I hope the short heartbeat interval doesn't interfere in the tests in which you don't expect things to turn unhealthy.

Reviewed 2 of 2 files at r3, 3 of 3 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 8, 2019

I hope the short heartbeat interval doesn't interfere in the tests in which you don't expect things to turn unhealthy.

I ran it in stress and stressrace for a bit and it seems happy.

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 8, 2019

Build failed

@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 8, 2019

bors r+

craig bot pushed a commit that referenced this pull request Mar 8, 2019
35350: README: remove links to gitter r=bdarnell a=bdarnell

We're no longer providing support via this channel and want to direct users to the forum instead.

35433: rpc/nodedialer: avoid tripping breaker on context errors r=ajwerner a=ajwerner

In #34026 we added logic to ensure that the context was not canclled before
calling in to GRPCDial. This change extends that to also avoid calling
breaker.Fail if the returned error was context.Canceled.

The motivation for this is an observation that breakers can trip due to context
cancellation which race with calls to Dial (from say DistSQL). This behavior
was observed after a node died due to an unrelated corruption bug. It appears
that this node failure triggered a context cancellation which then tripped a
breaker which then lead to a different flow to fail which then lead to another
cancellation which seems to have then tripped another breaker. The evidence for
this exact serioes of events is somewhat scant but we do know for certain that
we saw breakers tripped due to context cancelled which seems wrong.

```
ip-172-31-44-174> I190305 14:53:47.387863 150672 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n6] circuitbreaker: rpc [::]:26257->2 tripped: failed to grpc dial n2 at ip-172-31-34-81:26257: context canceled
```

This change also cosmetically refactors DialInternalClient and Dial to share
some copy-pasted code which was becoming burdensome.

Release note: None

Co-authored-by: Ben Darnell <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@craig
Copy link
Contributor

craig bot commented Mar 8, 2019

Build succeeded

@craig craig bot merged commit 2d0c3bf into cockroachdb:master Mar 8, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 11, 2019
This PR backports the idea behind cockroachdb#35433 to release-2.0.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 20, 2019
This commit backports the intention behind cockroachdb#35433 to release v2.1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants