-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
teamcity: failed tests on master: acceptance/bank/node-restart #29149
Comments
This is reproducible with the new |
This is the same underlying issue as #30613 (comment). |
In trying to reproduce / debug #29678, I'm instead hitting this bug. I'm adding some instrumentation to verify the error is occurring when distsql is attempt to dial a connection. @tschottdorf It looks to me that |
The instrumentation shows that the error is being generated from:
This roachtest is frequently stopping and restarting nodes. I'm guessing we're trying to schedule a flow on a recently restarted node. |
Note that for the @jordanlewis or @RaduBerinde Does DistSQL take into consideration node health during planning? I don't see any reference to |
We do, see cockroach/pkg/sql/distsql_physical_planner.go Line 764 in b2bd8e8
|
Yeah that looks right. Don't think this follows some system, see: cockroach/pkg/rpc/nodedialer/nodedialer.go Lines 147 to 148 in 4e80b4a
|
@tschottdorf Yep, that is exactly the discrepancy. Interestingly, if I move distsql in the other direction and add a |
@RaduBerinde Thanks. I missed that. I'll enable some of that vlogging to try and find out why distsql is attempting to talk to a node that just restarted but which has an open breaker. I think it might be possible for |
@tschottdorf I'm going to try the following diff: --- a/pkg/rpc/nodedialer/nodedialer.go
+++ b/pkg/rpc/nodedialer/nodedialer.go
@@ -144,8 +144,10 @@ func (n *Dialer) DialInternalClient(
if err != nil {
return nil, nil, err
}
- // TODO(bdarnell): Reconcile the different health checks and circuit
- // breaker behavior in this file
+ // TODO(bdarnell): Reconcile the different health checks and circuit breaker
+ // behavior in this file. Marking the breaker as open here is a stop-gap
+ // before that reconciliation.
+ n.getBreaker(nodeID).Success()
if err := grpcutil.ConnectionReady(conn); err != nil {
return nil, nil, err
} |
Hmm, took 85 runs, but the |
FYI, I'm trying to reproduce with some additional instrumentation which will make it clearer when this |
My additional instrumentation is showing that my theory above seems correct:
So one goroutine (the first log message) is using So now I'm confused as to why my diff above didn't fix the issue. Or perhaps it fixed one issue but there is another issue that has the same symptoms. I wonder if it would be better to fix this from the other end and have distsql not plan processors on nodes where the connection breaker is open. @tschottdorf, @RaduBerinde Thoughts? Oh, looking at the circuit breaker code just now and the problem might be calling |
@petermattis it seems like the right thing for DistSQL to do is to check the breaker when determining node health, and to retry on breaker errors (i.e. not return them to clients) which can still occur if the breaker flips after DistSQL has determined that it thinks the node is healthy. Then this test would still expect errors that occur during DistSQL execution to bubble up because I don't think those are handled correctly (according to Jordan's post). I think what shoves all of these DistSQL problems under the rug is this (which the bank test also calls):
|
@tschottdorf Is that the right thing to do in the short term? The breaker is open incorrectly as another part of the system disregarded the breaker and successfully opened a connection. Seems to me that the breaker should be reset when that occurs. Medium to long term we should see about unifying the usage of breakers across all dial attempts, but I'm very nervous about making that change now. I'm currently running with this diff which is similar to the one above except --- a/pkg/rpc/nodedialer/nodedialer.go
+++ b/pkg/rpc/nodedialer/nodedialer.go
@@ -144,8 +144,10 @@ func (n *Dialer) DialInternalClient(
if err != nil {
return nil, nil, err
}
- // TODO(bdarnell): Reconcile the different health checks and circuit
- // breaker behavior in this file
+ // TODO(bdarnell): Reconcile the different health checks and circuit breaker
+ // behavior in this file. Marking the breaker as open here is a stop-gap
+ // before that reconciliation.
+ n.getBreaker(nodeID).Reset()
if err := grpcutil.ConnectionReady(conn); err != nil {
return nil, nil, err
}
@jordanlewis Making DistSQL resilient to node failure seems like something we give more priority to. This feels like a decent size project. |
This looks good (since |
|
Ugh, just got a
I think my patch above helped and this is another symptom of the problem. Investigating. |
Ok, got another repro. Here's the timeline:
My suspicion as to what is happening is that distsql planning is seeing an open conn to
Looks like this code originated in 3e2446f. @RaduBerinde do you remember why you included |
I don't think I had a good reason, I was probably being conservative. |
Ack. I added more instrumentation that was supposed to highlight this as the problem, but so far that instrumentation has not pointed a finger at this code. So I'm adding even more instrumentation to see which nodes distsql is planning on. |
Ok, my additional instrumentation is pointing to a problem with considering
An interesting bit about this failure is that the breaker was opened before distsql planning even began:
So perhaps @tschottdorf's suggestion to not plan on nodes with an open breaker is the right approach here. That's straightforward to implement. |
`Dialer.DialInternalClient` does not check the circuit breaker but blindly attempts a connection and can succeed, leaving the system in a state where there is a healthy connection to a node, but the circuit breaker used for dialing is open. DistSQL checks for connection health when scheduling processors, but the connection health check does not examine the breaker. So DistSQL will proceed to schedule a processor on a node but then be unable to use the connection to that node because `Dialer.Dial` will return with a `breaker open` error. The code contains a TODO to reconcile the handling of circuit breakers in the various `Dialer` methods, but changing the handling is risky in the short term. As a stop-gap, we reset the breaker after a connection is successfully opened. Fixes cockroachdb#29149 Release note: None
30987: sql,rpc/nodedialer: improve distsql node health checks r=tschottdorf a=petermattis Improve distsql node health checks so that the presence of an open circuit breaker is consider. Previously it was possible for distsql to plan a processor on a node with an open circuit breaker which ensured an "unable to dial" error when the plan was run. Fixes #29149 Fixes #28704 Release note: None Co-authored-by: Peter Mattis <[email protected]>
`Dialer.DialInternalClient` does not check the circuit breaker but blindly attempts a connection and can succeed, leaving the system in a state where there is a healthy connection to a node, but the circuit breaker used for dialing is open. DistSQL checks for connection health when scheduling processors, but the connection health check does not examine the breaker. So DistSQL will proceed to schedule a processor on a node but then be unable to use the connection to that node because `Dialer.Dial` will return with a `breaker open` error. The code contains a TODO to reconcile the handling of circuit breakers in the various `Dialer` methods, but changing the handling is risky in the short term. As a stop-gap, we reset the breaker after a connection is successfully opened. Fixes cockroachdb#29149 Release note: None
The following tests appear to have failed:
#864595:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: