Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autopilot: cluster can fail to recover from outages when min_quorum isn't set. #8118

Open
nh2 opened this issue Jun 16, 2020 · 13 comments
Open
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/docs Documentation needs to be created/updated/clarified

Comments

@nh2
Copy link

nh2 commented Jun 16, 2020

I co-maintain the consul package in the NixOS Linux distribution. We have an automated VM-based test here to check that consul works.

The test starts 3 consul servers, and reboots each of them in a rolling fashion, checking that consensus is re-established when they come back up:

# Test that the cluster can tolearate failures of any single server:
for server in [server1, server2, server3]:
    server.crash()
    # wait for consul KV operations to work again
    server.start()
    # ...

This test started failing between consul 1.7.0-beta1 and 1.7.0-beta2.

server1 reboots and Consul works fine then. But as soon as the second server (server2) goes down, Consul returns No cluster leader, and never recovers from it, even though the other 2 servers are up. It fails reproducibly.

Analysis

As far as I can tell, the reason is that after its reboot, server1 is not accepted back as a voter. Then, when server2 goes down, only 2 servers are left, of which one is not a voter, so they cannot elect a new leader.

With the failing 1.7.0-beta2 and newer, I observe that just after the reboot of server1, its Status is left:

# consul members
Node     Address           Status  Type    Build        Protocol  DC   Segment
server1  192.168.1.1:8301  left    server  1.7.0-beta1  2         dc1  <all>
server2  192.168.1.2:8301  alive   server  1.7.0-beta1  2         dc1  <all>
server3  192.168.1.3:8301  alive   server  1.7.0-beta1  2         dc1  <all>

This was at first surprising to me, because I expected it to be failed, given that .crash() from the test hard-crashes the VM. Further, the default options for servers (as explained on the options docs) are:

so even if the server shut down cleanly instead of hard-crashing, I'd expect the server to go failed, not left.

# consul operator raft list-peers
Node     ID                                    Address           State     Voter  RaftProtocol
server3  420a2da0-4547-d06c-d4a5-8afb1a2fbe8d  192.168.1.3:8300  leader    true   3
server2  1cb012d8-4dca-783e-5292-9febf80c08de  192.168.1.2:8300  follower  true   3
server1  a071fbe3-b707-139c-0353-f276d248f104  192.168.1.1:8300  follower  false  3

This shows that server1 is Voter false, and never changes back from it.

(As soon as server2 goes down, consul operator raft list-peers stops working and also returns No cluster leader, and you have to use consul operator raft list-peers -stale=true to see the output.)

Observation: Rolling-rebooting "more slowly" fixes it

When performing the reboots by hand, instead of with the automated test script, I noticed that that works fine:

The Status left eventually (after around half a minute) turns back into alive in consul members, and the Voter false eventually tursn into true in consul operator raft list-peers.

But of course that is not a solution, because in my production environment, servers may crash and reboot unexpectedly, without "waiting" for other servers to crash. I expect Consul to tolerate arbitrary such crashes and come back without manual intervention, as it did before 1.7.0-beta2.

Git bisect

I read the changelog of v1.7.0-beta2 and the compare link for v1.7.0-beta1...v1.7.0-beta2.

From that, I identified the listed PR #4017 as the only possibly relevant change.

A git bisect between the two versions confirmed my guess; the commit that introduced the problem is c47dbff, part of above PR, with description:

autopilot: fix dead server removal condition to use correct failure tolerance (#4017)
Click to expand full git bisect log
$ git bisect log
git bisect start
# bad: [e01f6913516b4b852da54dad42e8a194c756cfd5] Release v1.7.0-beta2
git bisect bad e01f6913516b4b852da54dad42e8a194c756cfd5
# good: [dcab9358318ce8ea64939d323f8d291fb39178af] Release v1.7.0-beta1
git bisect good dcab9358318ce8ea64939d323f8d291fb39178af
# bad: [77e789a4f4fdb1fc4e6268a074bb7360f10f3605] ui: Add optional name="" attribute so you can name slots with it (#6740)
git bisect bad 77e789a4f4fdb1fc4e6268a074bb7360f10f3605
# good: [1080be2087786fed04f3c4afca0d25539b27a189] Merge pull request #6891 from hashicorp/helm-docs-dec5
git bisect good 1080be2087786fed04f3c4afca0d25539b27a189
# bad: [83c84d4e7e3ca06b2143aa7939b31bcde785cd66] coverage: disable comment and project status, set informational mode (#6954)
git bisect bad 83c84d4e7e3ca06b2143aa7939b31bcde785cd66
# good: [f03153f571feb7cccdbc100fb8438778e449abe3] Merge pull request #6805 from hashicorp/issue-6804-sysctl-path
git bisect good f03153f571feb7cccdbc100fb8438778e449abe3
# good: [4f5d5020b8c37bd8c802164f410d728a16329f4b] dns: fix memoryleak by upgrading outdated miekg/dns (#6748)
git bisect good 4f5d5020b8c37bd8c802164f410d728a16329f4b
# bad: [c47dbffe1c5464d8aa9794e58687d894f1e50df9] autopilot: fix dead server removal condition to use correct failure tolerance (#4017)
git bisect bad c47dbffe1c5464d8aa9794e58687d894f1e50df9
# first bad commit: [c47dbffe1c5464d8aa9794e58687d894f1e50df9] autopilot: fix dead server removal condition to use correct failure tolerance (#4017)

Suspicion: Autopilot

The Autopilot article explains:

Server stabilization time
When a new server is added to the datacenter, there is a waiting period where it must be healthy and stable for a certain amount of time before being promoted to a full, voting member. This is defined by the ServerStabilizationTime autopilot's parameter and by default is 10 seconds.

Dead server cleanup
Autopilot helps prevent these kinds of outages by quickly removing failed servers as soon as a replacement Consul server comes online. When servers are removed by the cleanup process they will enter the "left" state.

"Dead server cleanup" explains why I see left instead of failed in consul members.

"Server stabilization time" seems to explain the main problem, that the rebooted server is Voter false for a while, and that the script reboots server2 within less than 10 seconds of server1 being back, thus not allowing server1 to be promoted back to Voter before that.

How should this work?

While the autopilot docs explain what happens, I still consider it bugged:

  • Consul <= 1.7.0-beta1 was a HA system that tolerated arbitrary reboots and would eventually recover back to consensus. For newer versions, that is no longer the case.
  • The autopilot's logic in "Server stabilization time" makes sense when you replace one (decommissioned) server by another, but it does not seem to make sense to me to apply that when a machine just comes back from reboot, and no replacement server tried to take its place.
    • It seems that a it should not require a quorate vote to allow a just-rebooted server to vote again, otherwise any temporary break of quorum will turn into a permanent break of quorum. That also includes just rebooting all servers in paralle.
  • Side issues:
    • The docs of leave_on_terminate and skip_leave_on_interrupt should really mention that autopilot's auto-leaving-by-others (which is enable by default) exists, and that thus whatever you configure for these does not last for more than 200 ms (the default value for LastContactThreshold).

What can be done about this?

@nh2
Copy link
Author

nh2 commented Jun 16, 2020

cc @preetapan @i0rek @banks @mkeeler @kyhavlov @pierresouchay @pearkes who are subscribed to #4017, but I cannot comment there because the issue is locked (which is slightly counterproductive for pointing out there retrospectively that a change broke something).

@nh2
Copy link
Author

nh2 commented Jun 16, 2020

For extra clarity, the sequence leading to the reported failure is:

  • Crash and reboot server1 until it's back.
  • Crash server2.

In this state (with server2 crashed), Consul will stop operating and not recover, even though 2 of 3 servers are up.

I have not tested yet whether the situation resolves if/when server2 comes back up, but I don't think that should matter, because Consul should recover with 2 of 3 servers up.

@banks
Copy link
Member

banks commented Jun 16, 2020

Hey @nh2 thanks for the details report!

I think you've done a great job of digging into the changes and isolating the commit that seems to break your test.

I think there are still a few nuances that are not super clear yet that we could work through.

The main thing that's still unclear to me here is exactly why the commit you highlight caused this behaviour change.

the commit that introduced the problem is c47dbff

Can you confirm: you actually built Consul from the commit right before that one and saw tests reliably pass then rebuilt from that commit and saw them reliably fail right? Do they fail every single run or just some large percentage?

The confusing part is exactly how that commit changes this behaviour. It would seem like the only explanation would be that for some reason your test was never triggering dead server cleanup before and now it reliably is, but the actual logic that changed shouldn't make a difference in your case. That leaves timing issues about when pruning runs but nothing change there so simple non-determinacy would not cause it to be so reliable to never fail before and always fail now.

The intended change of that commit is rather than blocking all removals if we would remove more than the fault-tolerance allowed, we instead remove up to the limit and then refuse to remove the rest. In your case though number of failures tolerated is always 1 under both before and after logic it shouldn't make a difference.

Would you be able to post the server logs during the tests in a gist? Autopilot logs when it decides not to remove things because it would violate quorum, so if my guess is right then you should see that log in the before case (on the leader) but not after that commit and it will help narrow down a bit that is the real problem.

How should this work

The autopilot's logic in "Server stabilization time" makes sense when you replace one (decommissioned) server by another, but it does not seem to make sense to me to apply that when a machine just comes back from reboot, and no replacement server tried to take its place.

The docs talk about a "new server" but in practice a server -- whether it's brand new or just coming back from reboot -- is not really "up" until raft is started, the on-disk snapshot has been loaded, logs replayed, and replication started and caught up with the leader. Until that point, the server is considered "down" which is why in your test which crashes the second server before the first is in a fully healthy state you loose quorum - you now have 2 of 3 in a non healthy state.

In your issue NixOS/nixpkgs#90613 you identified that waiting for raft to be healthy fixes the test and that's because that is the correct way to do a rolling restart. As we document in https://www.consul.io/docs/upgrading#standard-upgrades, just because a server process has started doesn't make it healthy and therefore it's not safe to take down the next node until you've observed those critical raft health signals.

So I'd say the fix for your test is already know - it's the one you showed in the issue above, but I'd still like to understand why this commit changed behaviour here even if your test could be considered to have been "getting lucky" before.

It seems that a it should not require a quorate vote to allow a just-rebooted server to vote again, otherwise any temporary break of quorum will turn into a permanent break of quorum.

I'm not sure I follow this. Until the server is healthy and can rejoin as a member it can't be part of the quorum so by definition there must be a quorum among the other nodes during that time otherwise you have exceeded the failure tolerance of the system.

Consul <= 1.7.0-beta1 was a HA system that tolerated arbitrary reboots and would eventually recover back to consensus. For newer versions, that is no longer the case.

As far as I can see, provided each server is allowed to rejoin and get to a healthy follower status in raft as documented, there is no HA issue here.

I would love to understand more about why this commit caused the observed behaviour change though - in looking at this there are several things that seem like they could be improved about this, just not yet clear there is an actual bug.

@banks
Copy link
Member

banks commented Jun 16, 2020

Oh also:

Consul returns No cluster leader, and never recovers from it

The not recovering part is worth understanding more too. There are some cases where loosing quorum simply can't be recovered without operator intervention without violating consistency. But this doesn't seem like it should be the case.

Again logs and output of raft list-peers and consul members while in this state would help us debug why that is the case.

Also, when you say "never recovers" how long did you observer the servers remaining in this state for?

@nh2
Copy link
Author

nh2 commented Jun 16, 2020

Hey @banks, thanks for looking at my issue.

Can you confirm: you actually built Consul from the commit right before that one and saw tests reliably pass then rebuilt from that commit and saw them reliably fail right?

Yes, that is what git bisect does, with its log in the expandable section of the issue description. Concretely for the commit and the one right before it:

# good: [4f5d5020b8c37bd8c802164f410d728a16329f4b] dns: fix memoryleak by upgrading outdated miekg/dns (#6748)
# bad: [c47dbffe1c5464d8aa9794e58687d894f1e50df9] autopilot: fix dead server removal condition to use correct failure tolerance (#4017)

Do they fail every single run or just some large percentage?

Every run I've made so far (I've done 5 on the commit before and 5 on the commit after to verify).

Would you be able to post the server logs during the tests in a gist?

Yes, the test conveniently logs all consul outputs of all machines. In fact it logs all output, also that of the test driver (so that you can see e.g. when a machine crash or restart is issued) and the journalctl of the machine. You may want to filter for consul[ to get just consul output.

Attaching the logs (with log_level = "debug"; the identified commit has [DEBUG] autopilot messages):

When it runs through successfully, the entire test completes in ~5 minutes on my laptop. I've cut off the nonrecovering, nonterminating test after ~8 minutes so that the log doesn't get too long.

For diffing, it may be useful to strip away the timestamps and PIDs, look at only consul messages, and stable-sort by machine name; on a shell with process substitution, I use this for a graphical diff:

cleanup(){ cat $1 | grep '^server' | grep 'consul\[' | perl -pe 's/^(.*? # )(.*? \[)(.*)$/\1\[\3/g' | perl -pe 's/\[\s*[\d\.]+]//g' | sort --stable --key=1.1,2.0 }
meld <(cleanup consul-log-successful-on-commit-4f5d502.txt) <(cleanup consul-log-failing-on-commit-c47dbffe1.txt)

The intended change of that commit is rather than blocking all removals if we would remove more than the fault-tolerance allowed, we instead remove up to the limit and then refuse to remove the rest. In your case though number of failures tolerated is always 1 under both before and after logic it shouldn't make a difference.

I suspect this from the successful log is relevant:

autopilot: Failed to remove dead servers: too many dead servers: 1/3

This does not appear in the failing log, where there is only autopilot: Attempting removal of failed server node "server1".


which is why in your test which crashes the second server before the first is in a fully healthy state you loose quorum - you now have 2 of 3 in a non healthy state.

Yes. And it's expected that Consul will refuse to work while 2 of 3 are unhealthy; the issue I have is that it does not seem to allow to move into "1 of 3 unhealthy" state unless the third server also comes back up.

just because a server process has started doesn't make it healthy

That's right, but note that the test isn't waiting for the process to be started, but for consul members to show alive for all nodes.

waiting for raft to be healthy fixes the test and that's because that is the correct way to do a rolling restart.

I have a question about this: The standard-upgrades docs you linked only state

Wait until the server is healthy and has rejoined the cluster before moving on to the next server.

but they do not state how to programmatically determine what counts as "healthy".

So far I thought that consul members showing alive means healthy, but apparently, I'm wrong. You're suggesting that checking for Voter true in consul operator raft list-peers is the right method, but (a) how shall one know that, (b) that command exists only since mid-2016 (#2312) and it seems odd that there was no way follow the upgrade procedure before, and (c) it is still crude, as a user scripting it I'd still have to make my own calculations of how many servers are non-Voters and what is allowed, as opposed to just asking consul whether it is healthy.

Also, when you say "never recovers" how long did you observer the servers remaining in this state for?

1 hour (that's when the NixOS tests time out, see the timed out after 3600 seconds from this example.

I'm not sure I follow this. Until the server is healthy and can rejoin as a member it can't be part of the quorum so by definition there must be a quorum among the other nodes during that time otherwise you have exceeded the failure tolerance of the system.
+
There are some cases where loosing quorum simply can't be recovered without operator intervention without violating consistency. But this doesn't seem like it should be the case.

To the second part: Yes, and according to my current understanding, this should not need operator intervetion.

To get it clearer, let me ask directly what I should expect the for this situation:

  • A cluster of 3 Consul working servers.
  • 2 machines die.
  • Now I get No cluster leader, that's expected, as we're 1-of-3.
  • 1 dead machine comes back after 2 minutes of downtime, the other one stays down.

What should happen now? Should Consul get into 2-of-3 consensus and thus resume serving requests, or should it continue to show No cluster leader and require operator intervention?

And I'm stating that before the commit, it did not require operator intervention.

@banks
Copy link
Member

banks commented Jun 17, 2020

Thanks for that info.

Concentrating on the specifics of the change here.

autopilot: Failed to remove dead servers: too many dead servers: 1/3

As you said, this confirms that prior to this commit Autopilot (AP) was just not removing the dead server at all.

The next question is, why would that affect the timing of the rebooted server rejoining and getting healthy. After digging around I think I understand better:

  1. If the server is not removed then the leader will continue to maintain it's state in the autopilot OperatorHealthReply.Servers list. Importantly it also does not remove the Peer from raft.
  2. Whenever the leader probes follower health (which it does every 2 seconds) it with check they are healthy (as in raft is running and up to date) and if so mark them as HEALTHY:
    // If this is a new server or the health changed, reset StableSince
    lastHealth := a.GetServerHealth(server.ID)
    if lastHealth == nil || lastHealth.Healthy != health.Healthy {
    health.StableSince = time.Now()
    } else {
    health.StableSince = lastHealth.StableSince
    }
  3. But it's important that we make a distinction between HEALTHY and STABLE, STABLE is only reached once the server has been healthy for at least the (default) 10 second server stabilisation time.
    • Note that we wait for the Stabilisation time for both brand new servers (no entry in existing AP state) and where it's not a new server but it previously failed.
  4. So even before the commit the rebooted server is considered unstable at this point, BUT since it was never removed from Raft, it rejoins as a voter since it was one before.

So because it's a voter when the next server is crashed it can participate in the election and stay up.

Note that it's not "stable" still at this point. Since your test data set is tiny, has few writes and there are a few seconds between each event this is enough that your tests always passed before, but it was at least technically possible that Serf would see the server as re-joined some time before it had actually completed replaying snapshots etc.

The old bug

So before the commit the logic was incorrect thanks to integer truncation off-by-one: c47dbff#diff-41b69736174d4cff3a831dfa52061cc1L237

We are checking that removalCount < peers/2 in this case that is 1 < 3/2 but since it's integer division 3/2 = 1. Which means we don't remove. The commit in question actually fixed this bug so now we do deadServers > (peers-1)/2 the minus one fixes the integer truncation thing: 1 > (3-1)/2 = 1 > 2/2 = 1 > 1 = false and since the logic reversed, this is now guarding the failure case not the success case so we allow it.

Recoverability

To get it clearer, let me ask directly what I should expect the for this situation:

A cluster of 3 Consul working servers.
2 machines die.
Now I get No cluster leader, that's expected, as we're 1-of-3.
1 dead machine comes back after 2 minutes of downtime, the other one stays down.
What should happen now? Should Consul get into 2-of-3 consensus and thus resume serving requests, or should it continue to show No cluster leader and require operator intervention?

And I'm stating that before the commit, it did not require operator intervention.

Great example, yes that is what I'd expect to happen with pure raft. Ideally it's what Consul would do to but there is a subtle issue caused by Autopilot here that can cause this case.

If you don't set min_quorum (added in 1.6.2) then it's at least in theory possible for raft's consistency to be broken when you start with more than 3 servers. For example if you have 5 servers, and 2 go down, you still have a quorum, AP will prune the two dead ones, and reduce the quorum size because it removes them from raft as if you actually wanted to downsize to just 3 servers. now quorum is only 2 of 3 and you can actually tolerate a third failure. This potentially violates linearizability of the set of servers as a whole - if you then recovered the 3 dead servers but somehow in a network partition separate from the 2 alive ones, then they would happily form a 3-of-5 quorum without any of the writes that occurred since the third one went down resulting in a split brain (we never heard of this actually happening to be clear but it's possible). min_quorum solves that.

With only 3 servers split brain can't happen so there is no safety issue. But there is a liveness one. After the first server is lost, AP will prune it (at least it will since that commit) leaving 2 raft voters which still has a quorum size of 2. If the leader fails next, the single follower will not be able to get enough votes to become leader so the cluster is down. If the follower fails next then the existing leader will eventually step down. Even if dead server pruning runs before the leader's raft loop detects the failure and steps down, it won't be able to commit the Raft config change to remove the second server without a quorum of 2 and so the second server is never actually removed from the raft config. So there can be no more "split brain" writes even if you partition the dead servers and bring them back up.

But what can happen is that depending on the order servers come back, you can result in a stuck cluster. In the above example, if the second server to fail (the one still in the raft config) comes back up, then all is well - raft continues, and the third server can rejoin and will become a full voter again after the stabilisation period. But if the first server to crash comes back (that is the one that was removed from raft by Autopilot) then it can't rejoin the cluster since there is no current leader who can commit a new raft config to add it back into the raft cluster. So this is the state your tests get into now - 2 of 3 servers are up, but only one of 2 servers that are in the current raft config are up so there is no quorum to be able to re-add the third. This is easy to reproduce locally:

  1. Run 3 consul servers joined to each other, let's call them s1, s2. s3, assume s1 is leader initially
  2. Kill s3
  3. observe (consul operator raft list-peers) autopilot removes s3, still have quorum.
  4. Kill s2
  5. Quorum is lost, note that AP doesn't run since there is no leader and it can't change config
  6. Restart s3 (if you restart s2 here it all recovers fine)
  7. s3 joins gossip cluster but is not a member of the raft config so can't vote to elect a new leader. Since there is no leader, nothing can add s3 back into the raft cluster to get the cluster stable. It's "stuck" indefinitely
  8. restart s2, now there is a quorum of [s1, s2] again, they can elect a leader who can then re-add s3 and get everything back to health.

The solution to this is exactly what we added min_quorum for. Ideally we'd require this to be set at startup or could infer it from bootstrap_expect since it's so important for clusters to behave as expected but there were issues with that and backwards compatibility.

If you set autopilot.min_quorum to 3 your tests will pass again and this should be how a 3 node cluster is run in production. We need to improve the documentation around that and possibly consider how we can default that.

Summary of findings

  1. The only reason this used to pass seems to be a bug - removing 1/3 was always "safe" and the intention of the old code was to allow this as far as I can see, however an off-by-one error meant that we didn't in this case. The change that broke the test happened to fix this logic.
  2. Our upgrade documentation certainly should be more specific about what "healthy" means. I speculate that part of the reason tooling to actually examine raft health didn't exist originally was that originally Consul datasets were imagined to be tiny and so restart and raft setup time was generally small so few people ran into issues where Serf health vs raft health made a difference - just like your test case. We should fix it to be clearer though: it's not safe to restart another server until a full quorum of others are up, stable and eligible to vote.
  3. Since you test was first written, we also have added the min_quorum configuration for autopilot to allow operators to specify explicitly that they never intend to run few than N servers and so autopilot should not clean up dead ones below that level. If you set this to 3 in your test config it would pass again without waiting for stabilisation. Although I still argue that waiting at least for raft health is technically safer than just Serf health, it's unlikely to be different enough in your test to matter once servers are not being pruned.

Action Items

  1. Improve the upgrading docs to be more explicit about how to measure health during a rolling upgrade.
  2. Improve the config docs for stabilisation time to make it clear it applies to unhealthy servers becoming healthy again too if they have been pruned by autopilot
  3. Improve the docs for min_quorum because they could be more specific (and the name is confusing, consider changing it), and given how important it is for correct operation should be more discoverable and covered in at least production guides if not basic setup guides if not already.
  4. Consider making min_quorum default to something sensible on initial raft bootstrap

@banks banks added theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/docs Documentation needs to be created/updated/clarified labels Jun 17, 2020
@banks banks added this to the 1.8.x milestone Jun 17, 2020
@banks banks changed the title Regression: Consensus permanently lost after rolling reboot autopilot: cluster can fail to recover from outages when min_quorum isn't set. Jun 17, 2020
@banks
Copy link
Member

banks commented Jun 17, 2020

@nh2 I've renamed the issue so that we can track the proposed action items above meaningfully now the issue is understood, I hope that's OK.

@nh2
Copy link
Author

nh2 commented Jun 18, 2020

@banks Thanks for your detailed and well-structured reply.

I understand it, and the summary and action items are great. I have nothing to add.

indeed min_quorum = 3 does fix the test. I will additionally update it to switch to consul operator raft list-peers instead, and later to the method recommended by the improved docs from action item 1.

(Of course I would prefer if this method could provide a simple yes/no answer, computed from the current state and Consul's config, instead of the user having to compute it themselves by counting and knowing the configured value of min_quorum, to avoid user mistakes.)

nh2 added a commit to nh2/nixpkgs that referenced this issue Jun 18, 2020
Done by setting `autopilot.min_quorum = 3`.

Techncially, this would have been required to keep the test correct since
Consul's "autopilot" "Dead Server Cleanup" was enabled by default (I believe
that was in Consul 0.8). Practically, the issue only occurred with our NixOS
test with releases >= `1.7.0-beta2` (see NixOS#90613). The setting itself is
available since Consul 1.6.2.

However, this setting was not documented clearly enough for anybody to notice,
and only the upstream issue hashicorp/consul#8118
I filed brought that to light.

As explained there, the test could also have been made pass by applying the
more correct rolling reboot procedure

    -m.wait_until_succeeds("[ $(consul members | grep -o alive | wc -l) == 5 ]")
    +m.wait_until_succeeds(
    +    "[ $(consul operator raft list-peers | grep true | wc -l) == 3 ]"
    +)

but we also intend to test that Consul can regain consensus even if
the quorum gets temporarily broken.
@banks
Copy link
Member

banks commented Jun 18, 2020

(Of course I would prefer if this method could provide a simple yes/no answer, computed from the current state and Consul's config, instead of the user having to compute it themselves by counting and knowing the configured value of min_quorum, to avoid user mistakes.)

This is a really good request.

We sort of have this already in an API endpoint that for some reason is not currently exposed in any of the CLI commands. See https://www.consul.io/api-docs/operator/autopilot#read-health.

If you were to curl that on any server and check for Healthy: true then that is a simple one-shot way to know it's safe to continue rolling servers.

But we are thinking about how to expose more information on health during an upgrade via API and CLI generally so will hopefully improve the UX here soon!

@nh2
Copy link
Author

nh2 commented Jun 20, 2020

We sort of have this already in an API endpoint that for some reason is not currently exposed in any of the CLI commands. See https://www.consul.io/api-docs/operator/autopilot#read-health.

What I find a bit confusing is that this is specific to autopilot, while https://www.consul.io/docs/upgrading#standard-upgrades suggests that the concept of "healthy" exists independent of whether you use autopilot.

Does this endpoint also exist/work when autopilot is disabled? (I personally am using autopilot so it wouldn't be a problem for me to use it, but I imagine others might have it disabled.)

@banks
Copy link
Member

banks commented Jun 23, 2020

What I find a bit confusing is that this is specific to autopilot, while https://www.consul.io/docs/upgrading#standard-upgrades suggests that the concept of "healthy" exists independent of whether you use autopilot.

Yeah, it's a bit confusing since those instructions pre-dated autopilot and were not specific enough. Autopilot was built specifically to address some of the operation difficulties around assessing cluster health and performing certain operations which is why it's now the best place to find this information but we didn't update those docs.

Does this endpoint also exist/work when autopilot is disabled? (I personally am using autopilot so it wouldn't be a problem for me to use it, but I imagine others might have it disabled.)

Short answer: yes.

Longer answer is that it's not super clear what you mean by "autopilot disabled". Autopilot is not just one feature. The main one in OSS is the dead_server_cleanup which you can disable but that isn't all of AP - the actual process on the leader that checks health of peers and manages promoting new servers only after they've become stable etc. is always running and can't be disabled. In other words AP is always running and that endpoint will always return current server health information regardless of your AP configuration/features in use.

@mikemorris mikemorris removed this from the 1.8.x milestone Nov 10, 2020
@mkeeler
Copy link
Member

mkeeler commented Nov 10, 2020

With the recent autopilot overhaul there is a new API for getting more information about the current state as is known to autopilot. The docs aren't live for the new API yet (but will once 1.9.0-beta3 or the final 1.9.0 version is released).

For now those docs are here in the code: https://github.com/hashicorp/consul/blob/master/website/pages/api-docs/operator/autopilot.mdx#read-the-autopilot-state

For the specific scenarios mentioned in this issue I am not sure it would provide a ton more value over using the existing v1/operator/autopilot/health API.

Even with those updates we still need to update the documentation to make it clearer how to use the min_quorum setting and why its necessary.

@ketzacoatl
Copy link
Contributor

Even with those updates we still need to update the documentation to make it clearer how to use the min_quorum setting and why its necessary.

That would be really great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests

5 participants