Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to join consul 1.7.x cluster due to other members having conflicting node id's #7396

Closed
TomRitserveldt opened this issue Mar 5, 2020 · 29 comments · Fixed by #7747
Closed
Assignees
Labels
needs-investigation The issue described is detailed and complex. theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected type/umbrella-☂️ Makes issue the "source of truth" for multiple requests relating to the same topic

Comments

@TomRitserveldt
Copy link

Overview of the Issue

since 1.7.0 we notice issues with members not being able to join because other clients in the DC have "conflicting" node ID's. example error below on a test environment in AWS, note that the client unable to join a completely different node than the one with conflicting node id:

Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:     2020/03/05 07:34:56 [WARN] agent: (LAN) couldn't join: 0 Err: 3 errors occurred:
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.38.239: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.36.51: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.37.206: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'

in this test environment I took 1 of the client, changed the node name and restarted the consul service (leaving the cluster and rejoining with a new name). on 1.6.4, this works and does not block other servers from joining.
below, consul member list from a server in the cluster, showing the 2 old node names as status left

[root@ip-172-22-37-158 ~]# consul members
Node                                             Address             Status  Type    Build  Protocol  DC       Segment
ip-172-22-36-208.eu-west-1.compute.internal      172.22.36.208:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-37-158.eu-west-1.compute.internal      172.22.37.158:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-38-132.eu-west-1.compute.internal      172.22.38.132:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-36-4.eu-west-1.compute.internal        172.22.36.4:8301    alive   client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.compute.internal       172.22.37.78:8301   alive   client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.compute.test.internal  172.22.37.78:8301   left    client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.test.internal          172.22.37.78:8301   left    client  1.6.4  2         sandbox  <default>

On consul 1.7.x the status for those clients is also "left" BUT as shown in the first log output, this blocks other clients from joining the cluster.
I think "left" clients should not cause duplicate id's and should definitely not block other clients from joining the cluster.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 2 client nodes and 3 server nodes, all nodes on 1.7.x
  2. change the node_name in the consul config.json file on one of the client nodes
  3. restart consul and rejoin with the new node name (probably see it fail for duplicate id already)
  4. the node should have left properly with the old nodename^
  5. rejoin consul on the 2nd client join, or try to join consul with a new client node
  6. see it fail due to the first member having duplicate node id's

Consul info for both Client and Server

Client info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 4
	services = 4
build:
	prerelease = 
	revision = 95fb95bf
	version = 1.7.0
consul:
	acl = disabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 53
	max_procs = 2
	os = linux
	version = go1.12.16
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 8
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 4
	member_time = 552
	members = 9
	query_queue = 0
	query_time = 1
Server info
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 4
	services = 4
build:
	prerelease = 
	revision = 95fb95bf
	version = 1.7.0
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.22.38.162:8300
	server = true
raft:
	applied_index = 4193
	commit_index = 4193
	fsm_pending = 0
	last_contact = 0
	last_log_index = 4193
	last_log_term = 11
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:889c7894-a360-9e48-6be4-304ac6cba83c Address:172.22.38.162:8300} {Suffrage:Voter ID:7f05c5b7-c8e9-65fd-0139-595c0a5fc94c Address:172.22.36.26:8300} {Suffrage:Voter ID:700913b8-dd52-54ae-a1ef-ba43d7346c71 Address:172.22.37.68:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 11
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 88
	max_procs = 2
	os = linux
	version = go1.12.16
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 9
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 553
	members = 5
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 24
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

Distributor ID: Debian
Description: Debian GNU/Linux 9.11 (stretch)
Release: 9.11
Codename: stretch

nodes in AWS

@TomRitserveldt
Copy link
Author

@hanshasselberg hanshasselberg self-assigned this Mar 5, 2020
@schristoff
Copy link
Contributor

Hey @TomRitserveldt ,

Thank you so much for bringing this up to us. It sounds like the problem you are seeing is that once a node enters a "left" state any other nodes cannot join the cluster (regardless of ID), is this true?

Based off your replication steps you've given it looks like once the node is in the left state you are trying to add a new node with the same ID. This will always fail because of the TombstoneTimeout in the Serf library. The TombstoneTimeout makes it so a node has to wait 24hours once entering the left state to be reaped. Once the node is reaped all the node's data is gone from the cluster and the node's ID, IP, etc can be reassigned. If you'd like to bypass this timeout I would recommend looking into the consul force-leave command.

Please let me know if the issue is the first item I mentioned and we can continue digging in.

@aep
Copy link

aep commented Mar 20, 2020

same issue. force-leave has no effect and the conflicts no longer shows up in members on any of the other nodes.

is there a way to set the tombstone timeout to something like 1 second to prevent this?

restarting the leader manually helps btw, even tho the error says the conflicts are on some other node

@TomRitserveldt
Copy link
Author

TomRitserveldt commented Mar 23, 2020

@s-christoff yes, any other unrelated nodes are unable to join the cluster because of these conflicting id's, regardless of their own id.

As @aep said as well, force-leave command does not fix this. the only way to fix for now is restart a consul server/leader every time this error occurs.

EDIT: We also do not have this issue at all running consul servers of any version below 1.7.x, so we feel like some behaviour regarding left nodes must have changed there. Even though we see nothing that would indicate that in the changelog

@leeif
Copy link

leeif commented Mar 23, 2020

Same here. Conflict error occurs when I change my server's hostname(which is used as the node name). force-leave has no effect and consul force-leave -prune <node> can remove the node from the members lis. However the conflict error seems to be continue.

@shamil
Copy link
Contributor

shamil commented Mar 29, 2020

I'm experiencing exactly same issue, even after TombstoneTimeout passed and left/failed member(s) no longer in the node list, they are still causing has conflicting node ID errors, and new agents failing to join cluster. Like other mentioned only after restarting consul server leader the error is gone.

I'm running consul 1.7.1

@benvanstaveren
Copy link

We also have this exact same issue; except with Consul 1.7.2 - renaming a node and restarting it is now causing all agents to be unable to rejoin, restarting the leader also does not solve the issue.

@hanshasselberg
Copy link
Member

Thank you for reporting! This is something we will look into for the next release!

@jsosulska jsosulska added needs-investigation The issue described is detailed and complex. theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected labels Apr 20, 2020
@benvanstaveren
Copy link

benvanstaveren commented Apr 20, 2020

Spinning up a new server node and having it join the cluster also fails:

# consul join srv-002.xxx.consul.yyy
Error joining address 'srv-002.xxx.consul.yyy': Unexpected response code: 500 (1 error occurred:
	* Failed to join 95.xxx.xxx.100: Member 'srv-002.zzz.consul.yyy' has conflicting node ID '518ab4e0-d07a-509d-21bc-cdd94ff6c212' with member 'srv-002-zzz'

So far I have tried doing a consul force-leave -prune for both conflicting names (e.g. srv-002-zzz and srv-002.zzz.consul.yyy) - repeatedly, until the consul leader claims there is no such member. Waiting another 10 minutes after that, attempting to join an agent (or server) to the cluster results right back in the same error.

Hoping the next release can happen soon...

@KLIM8D
Copy link

KLIM8D commented Apr 22, 2020

Restarting the consul server was needed in my case, before any node could join or even rejoin. If I stopped the consul service on any client, that was already a part of the cluster, they were unable to re-join afterwards getting the same error message about conflicting node id's.

Removing everything in the data-dir and starting consul with -disable-host-node-id or -node-id=.. on the new client didn't have any effect. As soon as the server was restarted, they joined the cluster.

Build: 1.7.2

@quinndiggityaxiom
Copy link

@KLIM8D you restarted ALL your consul servers, or do you only have one?

We are experiencing the same issue, and no nodes can join the cluster because nodes that aren't even in the cluster have conflicting IDs.

So, for example, because staging-docker-swarm-001 conflicts with docker-swarm-001 (and neither are in the cluster, aren't appearing as members under consul members, and have been consul force-leave -prune staging-docker-swarm-001/consul force-leave -prune docker-swarm-001 multiple times and are clearly stating No node found with name ...), staging-worker-abcde can't join the cluster, and everything is broken

Servers can't rejoin the cluster after restarting consul, sooooooo, we will have to rebuild Vault from scratch as its data is on consul and the cluster is borked

@hashicorp-support @i0rek can you please put some urgency on this

@quinndiggityaxiom
Copy link

Yep, this bug leads to complete data loss with consul

------ Warning, do not upgrade to consul 1.7.x ------

This is a disaster, how did this make it through QA?

@KLIM8D
Copy link

KLIM8D commented Apr 27, 2020

@quinndiggityaxiom Yes, I've only one consul server. I'm not sure whenever all servers has to be restarted, or just one, if you have more than one consul server

@jsosulska
Copy link
Contributor

Thank you to everyone who has been so patient in reporting this. We're actively working on this and are looking to get a fix out soon. We're tracking this for 1.7, and have seen it in 1.5 too.

Seems like #7445 and #7692 are also related.

We'll be using this issue to track all conflicting NodeID issues.

@jsosulska jsosulska added the type/umbrella-☂️ Makes issue the "source of truth" for multiple requests relating to the same topic label Apr 28, 2020
@princepathria97
Copy link

Any workaround on older versions? Not able join new node by

  1. data dir cleanup
  2. leave
  3. force-leave
  4. copy data from running node then start
  5. disable node id
    I'm facing the same error in 0.8.4 raft v2

@rustamgk
Copy link

rustamgk commented Jul 1, 2020

Hi guys, any progress on this issue ? We are facing the same on 1.7.2.

@hanshasselberg
Copy link
Member

@princepathria97 assuming the node that you want to used the node-id from is no longer alive, you should be able to boot the new node after 72h. This is how long memberlist will hold onto the node-id. If you are still experiencing problems please open a separate issue with a reference to this one.

@rustamgk yes, good news. We merge a fix and released it in Consul v1.7.3: #7747.

@alkalinecoffee
Copy link

Is there any guidance on whether our clients should upgrade to 1.7.3 (but servers remain on 1.7.2)? Or are we required to upgrade our servers to effectively resolve this issue?

We're unable to upgrade our servers to 1.7.3 at the moment, so we're wondering if our clients could run 1.7.3 to get around this issue until we can get to upgrading the servers.

@ryanmt
Copy link

ryanmt commented Apr 8, 2021

I've run into this on 1.6.6. Is there any technical reason why this fix wasn't pulled into 1.6.x releases? I don't see the change from pull #7747 in v1.6.10.

@srinirei
Copy link

Any update on this issue? We are using 17.3 and 3 consul servers. We have tight schedules for deployments and this one is failing our deployments. Restarting servers would not be a convenient one every time we see the issue.

@KLIM8D
Copy link

KLIM8D commented May 27, 2021

@srinirei Update consul. We're not experiencing this on v1.9.3

@andrewnazarov
Copy link

andrewnazarov commented Jul 28, 2021

We are experiencing the same on v1.10.0 (chart version 0.32.1). We noticed that -disable-host-node-id=false is removed in the newer version of the Chart.

Screenshot 2021-07-28 at 22 46 32

@xkkker
Copy link

xkkker commented Aug 12, 2021

any updates?

@nbari
Copy link

nbari commented Sep 27, 2021

setting disable_host_node_id to either false or true now helping, when creating 10 VMS for example half of them get the same node-id, any workaround?

@acodetailor
Copy link

meet same problem in 1.5.3

@msirovy
Copy link

msirovy commented May 31, 2022

Consul 1.9.0 - same problem

@Amier3
Copy link
Contributor

Amier3 commented Jun 3, 2022

Hey @msirovy and @acodetailor

We'd suggest an upgrade to 1.9.3 at least to see if that fixes the issue. It seems like this issue was partly fixed, but the bug is still occuring in some circumstances.

We'll continue tracking this bug in #13037 since it contains the most relevant and up to date explanation on what's going on.

@3coma3
Copy link

3coma3 commented Jun 6, 2022

For what is worth stopping the agent, removing the node-id file in the data directory and starting back the agent solved this problem for me (it was originated on cloned VMs). Upon restarting the agents generate a new distinct id and are able to join.

@Wicaeed
Copy link

Wicaeed commented Jul 13, 2023

^ Doing this also resolved the alerts for me in our EKS environment running Consul 1.16.0. Just had to exec into the offending hosts consul-client pod and remove the /data/node-id file manually, then deleted the pod itself to allow the daemonset to restart it.

As soon as we did this the alerts ceased. Now to identify exactly why this occurs...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex. theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected type/umbrella-☂️ Makes issue the "source of truth" for multiple requests relating to the same topic
Projects
None yet
Development

Successfully merging a pull request may close this issue.