-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
autopilot: cluster can fail to recover from outages when min_quorum isn't set. #8118
Comments
cc @preetapan @i0rek @banks @mkeeler @kyhavlov @pierresouchay @pearkes who are subscribed to #4017, but I cannot comment there because the issue is locked (which is slightly counterproductive for pointing out there retrospectively that a change broke something). |
For extra clarity, the sequence leading to the reported failure is:
In this state (with I have not tested yet whether the situation resolves if/when |
Hey @nh2 thanks for the details report! I think you've done a great job of digging into the changes and isolating the commit that seems to break your test. I think there are still a few nuances that are not super clear yet that we could work through. The main thing that's still unclear to me here is exactly why the commit you highlight caused this behaviour change.
Can you confirm: you actually built Consul from the commit right before that one and saw tests reliably pass then rebuilt from that commit and saw them reliably fail right? Do they fail every single run or just some large percentage? The confusing part is exactly how that commit changes this behaviour. It would seem like the only explanation would be that for some reason your test was never triggering dead server cleanup before and now it reliably is, but the actual logic that changed shouldn't make a difference in your case. That leaves timing issues about when pruning runs but nothing change there so simple non-determinacy would not cause it to be so reliable to never fail before and always fail now. The intended change of that commit is rather than blocking all removals if we would remove more than the fault-tolerance allowed, we instead remove up to the limit and then refuse to remove the rest. In your case though number of failures tolerated is always 1 under both before and after logic it shouldn't make a difference. Would you be able to post the server logs during the tests in a gist? Autopilot logs when it decides not to remove things because it would violate quorum, so if my guess is right then you should see that log in the before case (on the leader) but not after that commit and it will help narrow down a bit that is the real problem. How should this work
The docs talk about a "new server" but in practice a server -- whether it's brand new or just coming back from reboot -- is not really "up" until raft is started, the on-disk snapshot has been loaded, logs replayed, and replication started and caught up with the leader. Until that point, the server is considered "down" which is why in your test which crashes the second server before the first is in a fully healthy state you loose quorum - you now have 2 of 3 in a non healthy state. In your issue NixOS/nixpkgs#90613 you identified that waiting for raft to be healthy fixes the test and that's because that is the correct way to do a rolling restart. As we document in https://www.consul.io/docs/upgrading#standard-upgrades, just because a server process has started doesn't make it healthy and therefore it's not safe to take down the next node until you've observed those critical raft health signals. So I'd say the fix for your test is already know - it's the one you showed in the issue above, but I'd still like to understand why this commit changed behaviour here even if your test could be considered to have been "getting lucky" before.
I'm not sure I follow this. Until the server is healthy and can rejoin as a member it can't be part of the quorum so by definition there must be a quorum among the other nodes during that time otherwise you have exceeded the failure tolerance of the system.
As far as I can see, provided each server is allowed to rejoin and get to a healthy follower status in raft as documented, there is no HA issue here. I would love to understand more about why this commit caused the observed behaviour change though - in looking at this there are several things that seem like they could be improved about this, just not yet clear there is an actual bug. |
Oh also:
The not recovering part is worth understanding more too. There are some cases where loosing quorum simply can't be recovered without operator intervention without violating consistency. But this doesn't seem like it should be the case. Again logs and output of raft list-peers and consul members while in this state would help us debug why that is the case. Also, when you say "never recovers" how long did you observer the servers remaining in this state for? |
Hey @banks, thanks for looking at my issue.
Yes, that is what
Every run I've made so far (I've done 5 on the commit before and 5 on the commit after to verify).
Yes, the test conveniently logs all consul outputs of all machines. In fact it logs all output, also that of the test driver (so that you can see e.g. when a machine crash or restart is issued) and the Attaching the logs (with When it runs through successfully, the entire test completes in ~5 minutes on my laptop. I've cut off the nonrecovering, nonterminating test after ~8 minutes so that the log doesn't get too long. For diffing, it may be useful to strip away the timestamps and PIDs, look at only
I suspect this from the
This does not appear in the failing log, where there is only
Yes. And it's expected that Consul will refuse to work while 2 of 3 are unhealthy; the issue I have is that it does not seem to allow to move into "1 of 3 unhealthy" state unless the third server also comes back up.
That's right, but note that the test isn't waiting for the process to be started, but for
I have a question about this: The standard-upgrades docs you linked only state
but they do not state how to programmatically determine what counts as "healthy". So far I thought that
1 hour (that's when the NixOS tests time out, see the
To the second part: Yes, and according to my current understanding, this should not need operator intervetion. To get it clearer, let me ask directly what I should expect the for this situation:
What should happen now? Should Consul get into 2-of-3 consensus and thus resume serving requests, or should it continue to show And I'm stating that before the commit, it did not require operator intervention. |
Thanks for that info. Concentrating on the specifics of the change here.
As you said, this confirms that prior to this commit Autopilot (AP) was just not removing the dead server at all. The next question is, why would that affect the timing of the rebooted server rejoining and getting healthy. After digging around I think I understand better:
So because it's a voter when the next server is crashed it can participate in the election and stay up. Note that it's not "stable" still at this point. Since your test data set is tiny, has few writes and there are a few seconds between each event this is enough that your tests always passed before, but it was at least technically possible that Serf would see the server as re-joined some time before it had actually completed replaying snapshots etc. The old bugSo before the commit the logic was incorrect thanks to integer truncation off-by-one: c47dbff#diff-41b69736174d4cff3a831dfa52061cc1L237 We are checking that Recoverability
Great example, yes that is what I'd expect to happen with pure raft. Ideally it's what Consul would do to but there is a subtle issue caused by Autopilot here that can cause this case. If you don't set With only 3 servers split brain can't happen so there is no safety issue. But there is a liveness one. After the first server is lost, AP will prune it (at least it will since that commit) leaving 2 raft voters which still has a quorum size of 2. If the leader fails next, the single follower will not be able to get enough votes to become leader so the cluster is down. If the follower fails next then the existing leader will eventually step down. Even if dead server pruning runs before the leader's raft loop detects the failure and steps down, it won't be able to commit the Raft config change to remove the second server without a quorum of 2 and so the second server is never actually removed from the raft config. So there can be no more "split brain" writes even if you partition the dead servers and bring them back up. But what can happen is that depending on the order servers come back, you can result in a stuck cluster. In the above example, if the second server to fail (the one still in the raft config) comes back up, then all is well - raft continues, and the third server can rejoin and will become a full voter again after the stabilisation period. But if the first server to crash comes back (that is the one that was removed from raft by Autopilot) then it can't rejoin the cluster since there is no current leader who can commit a new raft config to add it back into the raft cluster. So this is the state your tests get into now - 2 of 3 servers are up, but only one of 2 servers that are in the current raft config are up so there is no quorum to be able to re-add the third. This is easy to reproduce locally:
The solution to this is exactly what we added If you set Summary of findings
Action Items
|
@nh2 I've renamed the issue so that we can track the proposed action items above meaningfully now the issue is understood, I hope that's OK. |
@banks Thanks for your detailed and well-structured reply. I understand it, and the summary and action items are great. I have nothing to add. indeed (Of course I would prefer if this method could provide a simple yes/no answer, computed from the current state and Consul's config, instead of the user having to compute it themselves by counting and knowing the configured value of |
Done by setting `autopilot.min_quorum = 3`. Techncially, this would have been required to keep the test correct since Consul's "autopilot" "Dead Server Cleanup" was enabled by default (I believe that was in Consul 0.8). Practically, the issue only occurred with our NixOS test with releases >= `1.7.0-beta2` (see NixOS#90613). The setting itself is available since Consul 1.6.2. However, this setting was not documented clearly enough for anybody to notice, and only the upstream issue hashicorp/consul#8118 I filed brought that to light. As explained there, the test could also have been made pass by applying the more correct rolling reboot procedure -m.wait_until_succeeds("[ $(consul members | grep -o alive | wc -l) == 5 ]") +m.wait_until_succeeds( + "[ $(consul operator raft list-peers | grep true | wc -l) == 3 ]" +) but we also intend to test that Consul can regain consensus even if the quorum gets temporarily broken.
This is a really good request. We sort of have this already in an API endpoint that for some reason is not currently exposed in any of the CLI commands. See https://www.consul.io/api-docs/operator/autopilot#read-health. If you were to curl that on any server and check for But we are thinking about how to expose more information on health during an upgrade via API and CLI generally so will hopefully improve the UX here soon! |
What I find a bit confusing is that this is specific to autopilot, while https://www.consul.io/docs/upgrading#standard-upgrades suggests that the concept of "healthy" exists independent of whether you use autopilot. Does this endpoint also exist/work when autopilot is disabled? (I personally am using autopilot so it wouldn't be a problem for me to use it, but I imagine others might have it disabled.) |
Yeah, it's a bit confusing since those instructions pre-dated autopilot and were not specific enough. Autopilot was built specifically to address some of the operation difficulties around assessing cluster health and performing certain operations which is why it's now the best place to find this information but we didn't update those docs.
Short answer: yes. Longer answer is that it's not super clear what you mean by "autopilot disabled". Autopilot is not just one feature. The main one in OSS is the |
With the recent autopilot overhaul there is a new API for getting more information about the current state as is known to autopilot. The docs aren't live for the new API yet (but will once 1.9.0-beta3 or the final 1.9.0 version is released). For now those docs are here in the code: https://github.com/hashicorp/consul/blob/master/website/pages/api-docs/operator/autopilot.mdx#read-the-autopilot-state For the specific scenarios mentioned in this issue I am not sure it would provide a ton more value over using the existing v1/operator/autopilot/health API. Even with those updates we still need to update the documentation to make it clearer how to use the |
That would be really great. |
I co-maintain the
consul
package in the NixOS Linux distribution. We have an automated VM-based test here to check that consul works.The test starts 3 consul servers, and reboots each of them in a rolling fashion, checking that consensus is re-established when they come back up:
This test started failing between consul
1.7.0-beta1
and1.7.0-beta2
.server1
reboots and Consul works fine then. But as soon as the second server (server2
) goes down, Consul returnsNo cluster leader
, and never recovers from it, even though the other 2 servers are up. It fails reproducibly.Analysis
As far as I can tell, the reason is that after its reboot,
server1
is not accepted back as a voter. Then, whenserver2
goes down, only 2 servers are left, of which one is not a voter, so they cannot elect a new leader.With the failing
1.7.0-beta2
and newer, I observe that just after the reboot ofserver1
, itsStatus
isleft
:This was at first surprising to me, because I expected it to be
failed
, given that.crash()
from the test hard-crashes the VM. Further, the default options for servers (as explained on the options docs) are:leave_on_terminate
- defaults tofalse
skip_leave_on_interrupt
- defaults totrue
so even if the server shut down cleanly instead of hard-crashing, I'd expect the server to go
failed
, notleft
.This shows that
server1
isVoter
false
, and never changes back from it.(As soon as
server2
goes down,consul operator raft list-peers
stops working and also returnsNo cluster leader
, and you have to useconsul operator raft list-peers -stale=true
to see the output.)Observation: Rolling-rebooting "more slowly" fixes it
When performing the reboots by hand, instead of with the automated test script, I noticed that that works fine:
The
Status
left
eventually (after around half a minute) turns back intoalive
inconsul members
, and theVoter
false
eventually tursn intotrue
inconsul operator raft list-peers
.But of course that is not a solution, because in my production environment, servers may crash and reboot unexpectedly, without "waiting" for other servers to crash. I expect Consul to tolerate arbitrary such crashes and come back without manual intervention, as it did before
1.7.0-beta2
.Git bisect
I read the changelog of v1.7.0-beta2 and the compare link for
v1.7.0-beta1...v1.7.0-beta2
.From that, I identified the listed PR #4017 as the only possibly relevant change.
A
git bisect
between the two versions confirmed my guess; the commit that introduced the problem is c47dbff, part of above PR, with description:Click to expand full
git bisect
logSuspicion: Autopilot
The Autopilot article explains:
"Dead server cleanup" explains why I see
left
instead offailed
inconsul members
."Server stabilization time" seems to explain the main problem, that the rebooted server is
Voter
false
for a while, and that the script rebootsserver2
within less than 10 seconds ofserver1
being back, thus not allowingserver1
to be promoted back toVoter
before that.How should this work?
While the autopilot docs explain what happens, I still consider it bugged:
1.7.0-beta1
was a HA system that tolerated arbitrary reboots and would eventually recover back to consensus. For newer versions, that is no longer the case.leave_on_terminate
andskip_leave_on_interrupt
should really mention that autopilot's auto-leaving-by-others (which is enable by default) exists, and that thus whatever you configure for these does not last for more than 200 ms (the default value forLastContactThreshold
).What can be done about this?
The text was updated successfully, but these errors were encountered: