Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot elect leader when cluster nodes up from 1 to 2 #10516

Open
MasonXon opened this issue Jun 29, 2021 · 11 comments
Open

Cannot elect leader when cluster nodes up from 1 to 2 #10516

MasonXon opened this issue Jun 29, 2021 · 11 comments

Comments

@MasonXon
Copy link

MasonXon commented Jun 29, 2021

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

When I do the consul auto-election testing, I found that when 2/3 node down, the cluster is stopped to work, I know this is right, when I start a node before stopped, now 2/3 node is started, but the log always output election timeout, and cluster still cannot provides services, with version 1.6.10, everything is ok, start with 1.7.0 is not ok, I tested version 1.6.10、1.7.0、1.9.6、1.10.0, only 1.6.10 work normally

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 3 server nodes
    2.stop 1/3 node
    3.stop 2/3 node
    4.start 1/2 node before stopped
    5.now 2/3 node started
    6.cluster is not work, there is no leader

Consul info for both Client and Server

Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease =
	revision = 95fb95bf
	version = 1.7.0
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = false
	leader_addr =
	server = true
raft:
	applied_index = 0
	commit_index = 0
	fsm_pending = 0
	last_contact = never
	last_log_index = 68
	last_log_term = 6
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc Address:10.6.0.21:8300} {Suffrage:Voter ID:d244e694-c619-cf0a-e3d6-701bd510b70d Address:10.6.0.22:8300} {Suffrage:Voter ID:5fc5e757-a1c5-e6f0-ed28-3149d68e44bf Address:10.6.0.23:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Candidate
	term = 108
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 73
	max_procs = 2
	os = linux
	version = go1.12.16
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 5
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 14
	members = 2
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 14
	members = 2
	query_queue = 0
	query_time = 1

Operating system and Environment details

cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
uname -a
Linux consul-02 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Virtual Machine

Log Fragments

consul members
Node Address Status Type Build Protocol DC Segment
consul-02 10.6.0.22:8301 alive server 1.7.0 2 my-dc-1
consul-03 10.6.0.23:8301 alive server 1.7.0 2 my-dc-1
consul operator raft list-peers
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

consul-01
now is stopped

consul-02
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.168Z [WARN] agent.server.raft: Election timeout reached, restarting election
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.168Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.22:8300 [Candidate]" term=122
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.169Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.169Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"

consul-03
Jun 25 14:40:56 consul-03 consul: 2021-06-25T14:40:56.707Z [INFO] agent.server.raft: entering follower state: follower="Node at 10.6.0.23:8300 [Follower]" leader=
Jun 25 14:41:00 consul-03 consul: 2021-06-25T14:41:00.100Z [ERROR] agent: Coordinate update error: error="No cluster leader"
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.921Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader=
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.921Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.23:8300 [Candidate]" term=128
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.922Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.923Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"
Jun 25 14:41:06 consul-03 consul: 2021-06-25T14:41:06.683Z [INFO] agent.server.raft: duplicate requestVote for same term: term=128
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.135Z [WARN] agent.server.raft: Election timeout reached, restarting election
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.135Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.23:8300 [Candidate]" term=129
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.136Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc"
Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.137Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"

@MasonXon
Copy link
Author

MasonXon commented Jun 29, 2021

I found another difference between 1.6.10 and 1.7.0,when I stop a node,the node stopped will not disappear with command consul operator raft list-peers

@MasonXon
Copy link
Author

MasonXon commented Jun 30, 2021 via email

@idrennanvmware
Copy link

idrennanvmware commented Jun 30, 2021

@xiangma0510 our experience has been that we need to do the reset peers.json option in the scenario you found. We dont have notes back as far as 1.6 so I'm not sure if we were affected the difference you have been

https://learn.hashicorp.com/tutorials/consul/recovery-outage#manual-recovery-using-peers-json

Is the steps we have followed in the past.

@MasonXon
Copy link
Author

MasonXon commented Jun 30, 2021 via email

@idrennanvmware
Copy link

I edited my post and added a link, but it's a manual recovery of a cluster. Ideally you can try the earlier steps in the document (but we never really had any success with those once we got in a state where we could no longer get a leader)

@MasonXon
Copy link
Author

I checked the link you posted,I will test whether peer.json can restore the cluster. but in different,I stop the consul with the command systemctl stop consul,this is not the same as outage,In my failure consul cluster, the two node started with candidate state, according to your Election Timeout mechanism, the leader should not be unable to be elected,Why can it be health after the third node is started

@MasonXon
Copy link
Author

MasonXon commented Jul 7, 2021

hi,I have tested the method of restoring the cluster using Peers. JSON, which can be used to restore the cluster, but this is only used in the case of an abnormal power outage, and according to the official documentation, this is supposed to be an incomplete way to restore the cluster, which is indeed not suitable for the situation we may encounter. I also need you to explain why the above situation makes the cluster unusable after version 1.6.10,I have tested etcd and zookeeper before and none of the above problems occurred

@blake
Copy link
Member

blake commented Jul 14, 2021

Hi @xiangma0510,

Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.

That said, the problem you described sounds similar to the scenario detailed in #8118. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.

#8118 (comment) offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.

Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?

autopilot {
  min_quorum = 3
}

Thanks.

@blake blake added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jul 14, 2021
@MasonXon
Copy link
Author

MasonXon commented Jul 14, 2021 via email

@MasonXon
Copy link
Author

MasonXon commented Jul 20, 2021 via email

@MasonXon
Copy link
Author

MasonXon commented Jul 20, 2021 via email

@jkirschner-hashicorp jkirschner-hashicorp removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants