-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot elect leader when cluster nodes up from 1 to 2 #10516
Comments
I found another difference between 1.6.10 and 1.7.0,when I stop a node,the node stopped will not disappear with command |
Hi, sorry to bother you.
I actually found this problem when simulating a failure scenario, if 2/3 of Consul nodes are down in production, does it mean that you can only restore Consul cluster by repairing all nodes instead of more than half of all nodes to restore service, but strangely, in 1.6.10, everything is working just right
… 2021年6月30日 下午8:11,idrennanvmware ***@***.***> 写道:
You likely have a split brain scenario here - You should go from 1->3 consul nodes (IMO) and this should resolve what you're seeing
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVLDABRQGFDOWUAAEODTVMCVRANCNFSM47PE4YHQ>.
|
@xiangma0510 our experience has been that we need to do the reset peers.json option in the scenario you found. We dont have notes back as far as 1.6 so I'm not sure if we were affected the difference you have been https://learn.hashicorp.com/tutorials/consul/recovery-outage#manual-recovery-using-peers-json Is the steps we have followed in the past. |
What’s the reset peer.json?recreate consul server agent?
… 2021年6月30日 下午8:42,idrennanvmware ***@***.***> 写道:
@xiangma0510 <https://github.com/xiangma0510> our experience has been that we need to do the reset peers.json option in the scenario you found. We dont have notes back as far as 1.6 so I'm not sure if we were affected the difference you have been
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVIBKL32EIFYUC27XKDTVMGKDANCNFSM47PE4YHQ>.
|
I edited my post and added a link, but it's a manual recovery of a cluster. Ideally you can try the earlier steps in the document (but we never really had any success with those once we got in a state where we could no longer get a leader) |
I checked the link you posted,I will test whether peer.json can restore the cluster. but in different,I stop the consul with the command |
hi,I have tested the method of restoring the cluster using Peers. JSON, which can be used to restore the cluster, but this is only used in the case of an abnormal power outage, and according to the official documentation, this is supposed to be an incomplete way to restore the cluster, which is indeed not suitable for the situation we may encounter. I also need you to explain why the above situation makes the cluster unusable after version 1.6.10,I have tested etcd and zookeeper before and none of the above problems occurred |
Hi @xiangma0510, Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server. That said, the problem you described sounds similar to the scenario detailed in #8118. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0. #8118 (comment) offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the Could you also try setting autopilot {
min_quorum = 3
} Thanks. |
thank for your replay, I will try later
… 2021年7月14日 下午1:22,Blake Covarrubias ***@***.***> 写道:
Hi @xiangma0510 <https://github.com/xiangma0510>,
Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.
That said, the problem you described sounds similar to the scenario detailed in #8118 <#8118>. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.
#8118 (comment) <#8118 (comment)> offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum <https://www.consul.io/docs/agent/options#min_quorum> value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.
Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?
autopilot {
min_quorum = 3
}
Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ>.
|
I have tested the parameter autopilot.min_quorum, and it doesn’t seem to work. Through issue 8118, I basically figured out why this problem occurs, but this parameter determines the minimum number of voters. When my 1/3 node fails At that time, the failed node is still marked as the left state, which will lead to if 2/3 nodes fail later, if the first failed node is added to the cluster first, the cluster still cannot work normally, if the first failure can be guaranteed The node configuration of is not removed from the raft configuration, that is, the failed state is maintained. Is it possible to guarantee that starting any one of the 2/3 failed nodes can ensure the normal operation of the cluster
… 2021年7月14日 下午1:22,Blake Covarrubias ***@***.***> 写道:
Hi @xiangma0510 <https://github.com/xiangma0510>,
Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.
That said, the problem you described sounds similar to the scenario detailed in #8118 <#8118>. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.
#8118 (comment) <#8118 (comment)> offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum <https://www.consul.io/docs/agent/options#min_quorum> value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.
Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?
autopilot {
min_quorum = 3
}
Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ>.
|
Here is my configuration:
datacenter = "my-dc-1"
data_dir = "/opt/consul"
client_addr = "0.0.0.0"
ui_config{
enabled = true
}
server = true
bind_addr = "10.6.0.13"
advertise_addr = "10.6.0.13"
bootstrap_expect=3
retry_join = ["10.6.0.11","10.6.0.12","10.6.0.13"]
autopilot {
min_quorum = 3
}
leave_on_terminate = false
skip_leave_on_interrupt = true
… 2021年7月20日 下午1:04,马翔 ***@***.***> 写道:
I have tested the parameter autopilot.min_quorum, and it doesn’t seem to work. Through issue 8118, I basically figured out why this problem occurs, but this parameter determines the minimum number of voters. When my 1/3 node fails At that time, the failed node is still marked as the left state, which will lead to if 2/3 nodes fail later, if the first failed node is added to the cluster first, the cluster still cannot work normally, if the first failure can be guaranteed The node configuration of is not removed from the raft configuration, that is, the failed state is maintained. Is it possible to guarantee that starting any one of the 2/3 failed nodes can ensure the normal operation of the cluster
> 2021年7月14日 下午1:22,Blake Covarrubias ***@***.*** ***@***.***>> 写道:
>
>
> Hi @xiangma0510 <https://github.com/xiangma0510>,
>
> Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.
>
> That said, the problem you described sounds similar to the scenario detailed in #8118 <#8118>. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.
>
> #8118 (comment) <#8118 (comment)> offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum <https://www.consul.io/docs/agent/options#min_quorum> value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.
>
> Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?
>
> autopilot {
> min_quorum = 3
> }
> Thanks.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ>.
>
|
When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.
Overview of the Issue
When I do the consul auto-election testing, I found that when 2/3 node down, the cluster is stopped to work, I know this is right, when I start a node before stopped, now 2/3 node is started, but the log always output election timeout, and cluster still cannot provides services, with version 1.6.10, everything is ok, start with 1.7.0 is not ok, I tested version 1.6.10、1.7.0、1.9.6、1.10.0, only 1.6.10 work normally
Reproduction Steps
Steps to reproduce this issue, eg:
2.stop 1/3 node
3.stop 2/3 node
4.start 1/2 node before stopped
5.now 2/3 node started
6.cluster is not work, there is no leader
Consul info for both Client and Server
Server info
Operating system and Environment details
cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
uname -a
Linux consul-02 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Virtual Machine
Log Fragments
consul members
Node Address Status Type Build Protocol DC Segment
consul-02 10.6.0.22:8301 alive server 1.7.0 2 my-dc-1
consul-03 10.6.0.23:8301 alive server 1.7.0 2 my-dc-1
consul operator raft list-peers
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
consul-01
now is stopped
consul-02
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.168Z [WARN] agent.server.raft: Election timeout reached, restarting election
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.168Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.22:8300 [Candidate]" term=122
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.169Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.169Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"
consul-03
Jun 25 14:40:56 consul-03 consul: 2021-06-25T14:40:56.707Z [INFO] agent.server.raft: entering follower state: follower="Node at 10.6.0.23:8300 [Follower]" leader=
Jun 25 14:41:00 consul-03 consul: 2021-06-25T14:41:00.100Z [ERROR] agent: Coordinate update error: error="No cluster leader"
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.921Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader=
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.921Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.23:8300 [Candidate]" term=128
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.922Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.923Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"
Jun 25 14:41:06 consul-03 consul: 2021-06-25T14:41:06.683Z [INFO] agent.server.raft: duplicate requestVote for same term: term=128
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.135Z [WARN] agent.server.raft: Election timeout reached, restarting election
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.135Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.23:8300 [Candidate]" term=129
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.136Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.137Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"
The text was updated successfully, but these errors were encountered: