-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
acceptance: localcluster nodes sometimes fail to get a NodeID assigned to them #10409
Comments
I noticed something interesting about this failure mode by examining the node logs produced by the failing tests and comparing them to those produced by passing tests. In failing tests, the nodes seem to have a lot of complaints in the logs about missing ranges or insufficient nodes:
and
and, at the end of the test,
Passing tests do not contain any log messages like these. Perhaps there is an infrequently-triggered bug in cluster startup? |
I think Peter just fixed this On Nov 3, 2016 16:44, "Jordan Lewis" [email protected] wrote:
|
I can reproduce it on current |
Here's a run that failed after @petermattis merged #10440: https://teamcity.cockroachdb.com/viewLog.html?buildId=41450&buildTypeId=Cockroach_UnitTests_Acceptance&tab=buildLog |
Okay, digging a little bit deeper I notice that on failing runs one of the nodes fails to get assigned a node id. Later in the logs of that node, we see the following connection errors:
Seems like some kind of issue with docker networking, perhaps? |
How are you reproducing it? Is it specific to azure? Why is gossip not retrying? Is it failing on retry? That line is only logged on the first attempt AFAICT. |
I reproduced it by running the acceptance tests again on a TeamCity agent. I haven't tried reproducing it outside of Azure, but the logs I'm looking at captured from one of the runs linked above. It does seem like gossip gets retried, and it also seems like the node thinks its connected via gossip:
|
Here is a gist containing the full node log: https://gist.github.com/jordanlewis/919306a9c66fa882d93e40cb2e0fd489 |
Just confirmed this is reproducible outside of Azure. I reproduced it on the first try on my dev machine. |
running into this as well here. |
I reproduced the failure with
Have there been any changes to |
The most recent change that might apply is a477ff1. It looks like we try the first node, get a timeout, then try the second one and get redirected back to the first one. But there's never any indication that we either make a second attempt or return an error to the caller. I think something may be getting stuck because the first request is still pending when it's told to retry. |
I haven't yet tracked down why the raft log size is that large. Or are you asking why the truncation was occurring? We truncate whenever the raft log size is larger than the range size.
Yes, there was an active snapshot for range 2 that was denying the snapshot for range 1. The range 2 snapshot was not completing because range 1 was below quorum: it had been up-replicated to 2 replicas but the 2nd replica is behind hence no operations on range 1 can proceed. Hence my suggestion that we never delay snapshots for range 1. Another thought is that we shouldn't up-replicate from 1 to 2 replicas until there is a 3rd node operational in the cluster. That wouldn't fix this problem, but seems like good practice. I still need to track down why the raft log size is so large that we're performing a truncation. And I'm not clear why we're allowing the truncation there in any case. We just added a second node, but the @bdarnell Not sure if you've been following along, but would appreciate your thoughts here. |
Isn't the truncation happening between the preemptive snapshot and the conf On Nov 5, 2016 12:57, "Peter Mattis" [email protected] wrote:
|
No, the truncation is happening after the conf change. Before the conf change completes we prevent any truncation from occurring using the |
Oops, I was mistaken. The truncation is happening before the conf change. The current The raft log size is exceeding the range size because it is filled with |
On Sat, Nov 5, 2016 at 2:20 PM, Peter Mattis [email protected]
I would argue that it should be changed; the deadlock here is obvious It's either that, or we always need to allow snapshot generation for range |
I'm not sure it solves the problem, though. I think there would still be a window after committing the conf change where the raft log could be truncated before the new replica is brought up to date.
I'm not sure always allowing snapshot generation for range 1 solves the problem, either. (Though it certainly solves it for the current crop of flaky tests). I think we would need to allow snapshot generation for any range containing meta1 or meta2 keys. That's easy enough to detect from the range descriptor. Allowing more than one snapshot to be generated at a time doesn't solve the problem as whatever you set N to you can imagine scenarios where that N is consumed and we need to snapshot a meta range to make progress. |
On Sat, Nov 5, 2016 at 2:39 PM, Peter Mattis [email protected]
|
We don't call |
Preserve the raft log, no matter how large it grows, during up-replication. Fixes cockroachdb#10409
Compute the index at which a quorum of nodes have committed instead of using raft.Status.Commit. The latter can be in advance of the computed quorum commit index just after a replica has been added to the group. And by computing the value ourselves we can include the pending snapshot index in the computation. Fixes cockroachdb#10409
Compute the index at which a quorum of nodes have committed instead of using raft.Status.Commit. The latter can be in advance of the computed quorum commit index just after a replica has been added to the group. And by computing the value ourselves we can include the pending snapshot index in the computation. Fixes cockroachdb#10409
Compute the index at which a quorum of nodes have committed instead of using raft.Status.Commit. The latter can be in advance of the computed quorum commit index just after a replica has been added to the group. And by computing the value ourselves we can include the pending snapshot index in the computation. Fixes cockroachdb#10409
Compute the index at which a quorum of nodes have committed instead of using raft.Status.Commit. The latter can be in advance of the computed quorum commit index just after a replica has been added to the group. And by computing the value ourselves we can include the pending snapshot index in the computation. Fixes cockroachdb#10409
Previously we were releasing the snapshot (i.e. calling `Replica.CloseOutSnap()`) when the ChangeReplicas operation completed. Now we release the snapshot as soon as the remote node has applied it. This is important to allow other ranges to make progress which might be required for the current ChangeReplicas operation to complete. Fixes cockroachdb#10483 See cockroachdb#10409
Previously we were releasing the snapshot (i.e. calling `Replica.CloseOutSnap()`) when the ChangeReplicas operation completed. Now we release the snapshot as soon as the remote node has applied it. This is important to allow other ranges to make progress which might be required for the current ChangeReplicas operation to complete. Fixes cockroachdb#10483 Fixes cockroachdb#10306 See cockroachdb#10409
The following tests appear to have failed:
#40720:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: