Cannot elect leader when cluster nodes up from 1 to 2 #10516

MasonXon · 2021-06-29T02:42:44Z

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

When I do the consul auto-election testing, I found that when 2/3 node down, the cluster is stopped to work, I know this is right, when I start a node before stopped, now 2/3 node is started, but the log always output election timeout, and cluster still cannot provides services, with version 1.6.10, everything is ok, start with 1.7.0 is not ok, I tested version 1.6.10、1.7.0、1.9.6、1.10.0, only 1.6.10 work normally

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with 3 server nodes
2.stop 1/3 node
3.stop 2/3 node
4.start 1/2 node before stopped
5.now 2/3 node started
6.cluster is not work, there is no leader

Consul info for both Client and Server

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 95fb95bf
	version = 1.7.0
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr =
	server = true
raft:
	applied_index = 0
	commit_index = 0
	fsm_pending = 0
	last_contact = never
	last_log_index = 68
	last_log_term = 6
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc Address:10.6.0.21:8300} {Suffrage:Voter ID:d244e694-c619-cf0a-e3d6-701bd510b70d Address:10.6.0.22:8300} {Suffrage:Voter ID:5fc5e757-a1c5-e6f0-ed28-3149d68e44bf Address:10.6.0.23:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Candidate
	term = 108
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 73
	max_procs = 2
	os = linux
	version = go1.12.16
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 5
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 14
	members = 2
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 14
	members = 2
	query_queue = 0
	query_time = 1

Operating system and Environment details

cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
uname -a
Linux consul-02 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Virtual Machine

Log Fragments

consul members
Node Address Status Type Build Protocol DC Segment
consul-02 10.6.0.22:8301 alive server 1.7.0 2 my-dc-1
consul-03 10.6.0.23:8301 alive server 1.7.0 2 my-dc-1
consul operator raft list-peers
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

consul-01
now is stopped

consul-02
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.168Z [WARN] agent.server.raft: Election timeout reached, restarting election
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.168Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.22:8300 [Candidate]" term=122
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.169Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.169Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"

consul-03
Jun 25 14:40:56 consul-03 consul: 2021-06-25T14:40:56.707Z [INFO] agent.server.raft: entering follower state: follower="Node at 10.6.0.23:8300 [Follower]" leader=
Jun 25 14:41:00 consul-03 consul: 2021-06-25T14:41:00.100Z [ERROR] agent: Coordinate update error: error="No cluster leader"
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.921Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader=
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.921Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.23:8300 [Candidate]" term=128
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.922Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.923Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"
Jun 25 14:41:06 consul-03 consul: 2021-06-25T14:41:06.683Z [INFO] agent.server.raft: duplicate requestVote for same term: term=128
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.135Z [WARN] agent.server.raft: Election timeout reached, restarting election
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.135Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.23:8300 [Candidate]" term=129
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.136Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.137Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"

The text was updated successfully, but these errors were encountered:

MasonXon · 2021-06-29T02:44:46Z

I found another difference between 1.6.10 and 1.7.0，when I stop a node，the node stopped will not disappear with command consul operator raft list-peers

MasonXon · 2021-06-30T12:40:03Z

Hi, sorry to bother you. I actually found this problem when simulating a failure scenario, if 2/3 of Consul nodes are down in production, does it mean that you can only restore Consul cluster by repairing all nodes instead of more than half of all nodes to restore service, but strangely, in 1.6.10, everything is working just right

…

2021年6月30日下午8:11，idrennanvmware ***@***.***> 写道： You likely have a split brain scenario here - You should go from 1->3 consul nodes (IMO) and this should resolve what you're seeing — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVLDABRQGFDOWUAAEODTVMCVRANCNFSM47PE4YHQ>.

idrennanvmware · 2021-06-30T12:41:59Z

@xiangma0510 our experience has been that we need to do the reset peers.json option in the scenario you found. We dont have notes back as far as 1.6 so I'm not sure if we were affected the difference you have been

https://learn.hashicorp.com/tutorials/consul/recovery-outage#manual-recovery-using-peers-json

Is the steps we have followed in the past.

MasonXon · 2021-06-30T12:58:05Z

What’s the reset peer.json？recreate consul server agent？

…

2021年6月30日下午8:42，idrennanvmware ***@***.***> 写道： @xiangma0510 <https://github.com/xiangma0510> our experience has been that we need to do the reset peers.json option in the scenario you found. We dont have notes back as far as 1.6 so I'm not sure if we were affected the difference you have been — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVIBKL32EIFYUC27XKDTVMGKDANCNFSM47PE4YHQ>.

idrennanvmware · 2021-06-30T13:01:09Z

I edited my post and added a link, but it's a manual recovery of a cluster. Ideally you can try the earlier steps in the document (but we never really had any success with those once we got in a state where we could no longer get a leader)

MasonXon · 2021-06-30T13:25:43Z

I checked the link you posted，I will test whether peer.json can restore the cluster. but in different,I stop the consul with the command systemctl stop consul，this is not the same as outage，In my failure consul cluster, the two node started with candidate state, according to your Election Timeout mechanism, the leader should not be unable to be elected，Why can it be health after the third node is started

MasonXon · 2021-07-07T01:59:11Z

hi,I have tested the method of restoring the cluster using Peers. JSON, which can be used to restore the cluster, but this is only used in the case of an abnormal power outage, and according to the official documentation, this is supposed to be an incomplete way to restore the cluster, which is indeed not suitable for the situation we may encounter. I also need you to explain why the above situation makes the cluster unusable after version 1.6.10,I have tested etcd and zookeeper before and none of the above problems occurred

blake · 2021-07-14T05:22:04Z

Hi @xiangma0510,

Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.

That said, the problem you described sounds similar to the scenario detailed in #8118. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.

#8118 (comment) offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.

Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?

autopilot {
  min_quorum = 3
}

Thanks.

MasonXon · 2021-07-14T06:09:32Z

thank for your replay, I will try later

…

2021年7月14日下午1:22，Blake Covarrubias ***@***.***> 写道： Hi @xiangma0510 <https://github.com/xiangma0510>, Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server. That said, the problem you described sounds similar to the scenario detailed in #8118 <#8118>. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0. #8118 (comment) <#8118 (comment)> offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum <https://www.consul.io/docs/agent/options#min_quorum> value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value. Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures? autopilot { min_quorum = 3 } Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ>.

MasonXon · 2021-07-20T05:05:08Z

I have tested the parameter autopilot.min_quorum, and it doesn’t seem to work. Through issue 8118, I basically figured out why this problem occurs, but this parameter determines the minimum number of voters. When my 1/3 node fails At that time, the failed node is still marked as the left state, which will lead to if 2/3 nodes fail later, if the first failed node is added to the cluster first, the cluster still cannot work normally, if the first failure can be guaranteed The node configuration of is not removed from the raft configuration, that is, the failed state is maintained. Is it possible to guarantee that starting any one of the 2/3 failed nodes can ensure the normal operation of the cluster

…

2021年7月14日下午1:22，Blake Covarrubias ***@***.***> 写道： Hi @xiangma0510 <https://github.com/xiangma0510>, Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server. That said, the problem you described sounds similar to the scenario detailed in #8118 <#8118>. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0. #8118 (comment) <#8118 (comment)> offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum <https://www.consul.io/docs/agent/options#min_quorum> value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value. Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures? autopilot { min_quorum = 3 } Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ>.

MasonXon · 2021-07-20T07:07:20Z

Here is my configuration: datacenter = "my-dc-1" data_dir = "/opt/consul" client_addr = "0.0.0.0" ui_config{ enabled = true } server = true bind_addr = "10.6.0.13" advertise_addr = "10.6.0.13" bootstrap_expect=3 retry_join = ["10.6.0.11","10.6.0.12","10.6.0.13"] autopilot { min_quorum = 3 } leave_on_terminate = false skip_leave_on_interrupt = true

…

2021年7月20日下午1:04，马翔 ***@***.***> 写道： I have tested the parameter autopilot.min_quorum, and it doesn’t seem to work. Through issue 8118, I basically figured out why this problem occurs, but this parameter determines the minimum number of voters. When my 1/3 node fails At that time, the failed node is still marked as the left state, which will lead to if 2/3 nodes fail later, if the first failed node is added to the cluster first, the cluster still cannot work normally, if the first failure can be guaranteed The node configuration of is not removed from the raft configuration, that is, the failed state is maintained. Is it possible to guarantee that starting any one of the 2/3 failed nodes can ensure the normal operation of the cluster > 2021年7月14日下午1:22，Blake Covarrubias ***@***.*** ***@***.***>> 写道： > > > Hi @xiangma0510 <https://github.com/xiangma0510>, > > Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server. > > That said, the problem you described sounds similar to the scenario detailed in #8118 <#8118>. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0. > > #8118 (comment) <#8118 (comment)> offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum <https://www.consul.io/docs/agent/options#min_quorum> value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value. > > Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures? > > autopilot { > min_quorum = 3 > } > Thanks. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub <#10516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ>. >

blake added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jul 14, 2021

jkirschner-hashicorp removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot elect leader when cluster nodes up from 1 to 2 #10516

Cannot elect leader when cluster nodes up from 1 to 2 #10516

MasonXon commented Jun 29, 2021 •

edited

Loading

MasonXon commented Jun 29, 2021 •

edited

Loading

MasonXon commented Jun 30, 2021 via email

idrennanvmware commented Jun 30, 2021 •

edited

Loading

MasonXon commented Jun 30, 2021 via email

idrennanvmware commented Jun 30, 2021

MasonXon commented Jun 30, 2021

MasonXon commented Jul 7, 2021

blake commented Jul 14, 2021

MasonXon commented Jul 14, 2021 via email

MasonXon commented Jul 20, 2021 via email

MasonXon commented Jul 20, 2021 via email

Cannot elect leader when cluster nodes up from 1 to 2 #10516

Cannot elect leader when cluster nodes up from 1 to 2 #10516

Comments

MasonXon commented Jun 29, 2021 • edited Loading

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

MasonXon commented Jun 29, 2021 • edited Loading

MasonXon commented Jun 30, 2021 via email

idrennanvmware commented Jun 30, 2021 • edited Loading

MasonXon commented Jun 30, 2021 via email

idrennanvmware commented Jun 30, 2021

MasonXon commented Jun 30, 2021

MasonXon commented Jul 7, 2021

blake commented Jul 14, 2021

MasonXon commented Jul 14, 2021 via email

MasonXon commented Jul 20, 2021 via email

MasonXon commented Jul 20, 2021 via email

MasonXon commented Jun 29, 2021 •

edited

Loading

MasonXon commented Jun 29, 2021 •

edited

Loading

idrennanvmware commented Jun 30, 2021 •

edited

Loading