-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql,rpc/nodedialer: improve distsql node health checks #30987
sql,rpc/nodedialer: improve distsql node health checks #30987
Conversation
`Dialer.DialInternalClient` does not check the circuit breaker but blindly attempts a connection and can succeed, leaving the system in a state where there is a healthy connection to a node, but the circuit breaker used for dialing is open. DistSQL checks for connection health when scheduling processors, but the connection health check does not examine the breaker. So DistSQL will proceed to schedule a processor on a node but then be unable to use the connection to that node because `Dialer.Dial` will return with a `breaker open` error. The code contains a TODO to reconcile the handling of circuit breakers in the various `Dialer` methods, but changing the handling is risky in the short term. As a stop-gap, we reset the breaker after a connection is successfully opened. Fixes cockroachdb#29149 Release note: None
Also closes #28704, right? This is great - thanks for taking it on. |
@jordanlewis Yep. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 6 of 6 files at r2, 3 of 3 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/rpc/nodedialer/nodedialer.go, line 153 at r1 (raw file):
// RPCs fail when dial fails due to an open breaker. Reset the breaker here // as a stop-gap before the reconciliation occurs. n.getBreaker(nodeID).Reset()
Thoughts about moving this below ConnectionReady
? What does that even do?
Change `DistSQLPlanner.checkNodeHealth` so that it uses `nodedialer.Dialer.ConnHealth` instead of `rpc.Context.ConnHealth`. The former is the right method to be calling to check a node's connection health. Refactor `DistSQLPlanner.checkNodeHealth` into a `distSQLNodeHealth` struct. This removed the need for `DistSQLPlannerTestingKnobs`. Enhance `nodedialer.Dialer.ConnHealth` to mark connections as unhealthy if the circuit breaker is open. This prevents DistSQL from planning processors on such nodes. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/rpc/nodedialer/nodedialer.go, line 153 at r1 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Thoughts about moving this below
ConnectionReady
? What does that even do?
Done. ConnectionReady
checks to see if the connection is in the "transient failure" state. I've added a comment about why it is useful (though I'm not 100% sure about whether that scenario can happen, I think we tear down connections really quickly when a heartbeat fails).
0e8fa32
to
569aa8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
bors r=tschottdorf |
30987: sql,rpc/nodedialer: improve distsql node health checks r=tschottdorf a=petermattis Improve distsql node health checks so that the presence of an open circuit breaker is consider. Previously it was possible for distsql to plan a processor on a node with an open circuit breaker which ensured an "unable to dial" error when the plan was run. Fixes #29149 Fixes #28704 Release note: None Co-authored-by: Peter Mattis <[email protected]>
Build succeeded |
Improve distsql node health checks so that the presence of an open
circuit breaker is consider. Previously it was possible for distsql to
plan a processor on a node with an open circuit breaker which ensured an
"unable to dial" error when the plan was run.
Fixes #29149
Fixes #28704
Release note: None