unable to join consul 1.7.x cluster due to other members having conflicting node id's #7396

TomRitserveldt · 2020-03-05T10:57:43Z

Overview of the Issue

since 1.7.0 we notice issues with members not being able to join because other clients in the DC have "conflicting" node ID's. example error below on a test environment in AWS, note that the client unable to join a completely different node than the one with conflicting node id:

Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:     2020/03/05 07:34:56 [WARN] agent: (LAN) couldn't join: 0 Err: 3 errors occurred:
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.38.239: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.36.51: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'
Mar 05 07:34:56 ip-172-22-36-4 consul[17772]:         * Failed to join 172.22.37.206: Member 'ip-172-22-37-78.eu-west-1.compute.test.internal' has conflicting node ID 'b216bc09-937a-41d6-b681-9401413f2d9b' with member 'ip-172-22-37-78.eu-west-1.test.internal'

in this test environment I took 1 of the client, changed the node name and restarted the consul service (leaving the cluster and rejoining with a new name). on 1.6.4, this works and does not block other servers from joining.
below, consul member list from a server in the cluster, showing the 2 old node names as status left

[root@ip-172-22-37-158 ~]# consul members
Node                                             Address             Status  Type    Build  Protocol  DC       Segment
ip-172-22-36-208.eu-west-1.compute.internal      172.22.36.208:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-37-158.eu-west-1.compute.internal      172.22.37.158:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-38-132.eu-west-1.compute.internal      172.22.38.132:8301  alive   server  1.6.4  2         sandbox  <all>
ip-172-22-36-4.eu-west-1.compute.internal        172.22.36.4:8301    alive   client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.compute.internal       172.22.37.78:8301   alive   client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.compute.test.internal  172.22.37.78:8301   left    client  1.6.4  2         sandbox  <default>
ip-172-22-37-78.eu-west-1.test.internal          172.22.37.78:8301   left    client  1.6.4  2         sandbox  <default>

On consul 1.7.x the status for those clients is also "left" BUT as shown in the first log output, this blocks other clients from joining the cluster.
I think "left" clients should not cause duplicate id's and should definitely not block other clients from joining the cluster.

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with 2 client nodes and 3 server nodes, all nodes on 1.7.x
change the node_name in the consul config.json file on one of the client nodes
restart consul and rejoin with the new node name (probably see it fail for duplicate id already)
the node should have left properly with the old nodename^
rejoin consul on the 2nd client join, or try to join consul with a new client node
see it fail due to the first member having duplicate node id's

Consul info for both Client and Server

Client info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 4
	services = 4
build:
	prerelease = 
	revision = 95fb95bf
	version = 1.7.0
consul:
	acl = disabled
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 53
	max_procs = 2
	os = linux
	version = go1.12.16
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 8
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 4
	member_time = 552
	members = 9
	query_queue = 0
	query_time = 1

Server info

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 4
	services = 4
build:
	prerelease = 
	revision = 95fb95bf
	version = 1.7.0
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.22.38.162:8300
	server = true
raft:
	applied_index = 4193
	commit_index = 4193
	fsm_pending = 0
	last_contact = 0
	last_log_index = 4193
	last_log_term = 11
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:889c7894-a360-9e48-6be4-304ac6cba83c Address:172.22.38.162:8300} {Suffrage:Voter ID:7f05c5b7-c8e9-65fd-0139-595c0a5fc94c Address:172.22.36.26:8300} {Suffrage:Voter ID:700913b8-dd52-54ae-a1ef-ba43d7346c71 Address:172.22.37.68:8300}]
	latest_configuration_index = 0
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 11
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 88
	max_procs = 2
	os = linux
	version = go1.12.16
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 9
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 553
	members = 5
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 24
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

Distributor ID: Debian
Description: Debian GNU/Linux 9.11 (stretch)
Release: 9.11
Codename: stretch

nodes in AWS

The text was updated successfully, but these errors were encountered:

TomRitserveldt · 2020-03-05T10:58:20Z

relevant discuss post:
https://discuss.hashicorp.com/t/consul-agent-has-conflicting-node-id/6469

schristoff · 2020-03-19T15:36:28Z

Hey @TomRitserveldt ,

Thank you so much for bringing this up to us. It sounds like the problem you are seeing is that once a node enters a "left" state any other nodes cannot join the cluster (regardless of ID), is this true?

Based off your replication steps you've given it looks like once the node is in the left state you are trying to add a new node with the same ID. This will always fail because of the TombstoneTimeout in the Serf library. The TombstoneTimeout makes it so a node has to wait 24hours once entering the left state to be reaped. Once the node is reaped all the node's data is gone from the cluster and the node's ID, IP, etc can be reassigned. If you'd like to bypass this timeout I would recommend looking into the consul force-leave command.

Please let me know if the issue is the first item I mentioned and we can continue digging in.

aep · 2020-03-20T18:28:59Z

same issue. force-leave has no effect and the conflicts no longer shows up in members on any of the other nodes.

is there a way to set the tombstone timeout to something like 1 second to prevent this?

restarting the leader manually helps btw, even tho the error says the conflicts are on some other node

TomRitserveldt · 2020-03-23T08:03:28Z

@s-christoff yes, any other unrelated nodes are unable to join the cluster because of these conflicting id's, regardless of their own id.

As @aep said as well, force-leave command does not fix this. the only way to fix for now is restart a consul server/leader every time this error occurs.

EDIT: We also do not have this issue at all running consul servers of any version below 1.7.x, so we feel like some behaviour regarding left nodes must have changed there. Even though we see nothing that would indicate that in the changelog

leeif · 2020-03-23T11:27:56Z

Same here. Conflict error occurs when I change my server's hostname(which is used as the node name). force-leave has no effect and consul force-leave -prune <node> can remove the node from the members lis. However the conflict error seems to be continue.

shamil · 2020-03-29T23:58:58Z

I'm experiencing exactly same issue, even after TombstoneTimeout passed and left/failed member(s) no longer in the node list, they are still causing has conflicting node ID errors, and new agents failing to join cluster. Like other mentioned only after restarting consul server leader the error is gone.

I'm running consul 1.7.1

benvanstaveren · 2020-04-20T11:14:40Z

We also have this exact same issue; except with Consul 1.7.2 - renaming a node and restarting it is now causing all agents to be unable to rejoin, restarting the leader also does not solve the issue.

hanshasselberg · 2020-04-20T12:50:19Z

Thank you for reporting! This is something we will look into for the next release!

benvanstaveren · 2020-04-20T14:31:31Z

Spinning up a new server node and having it join the cluster also fails:

# consul join srv-002.xxx.consul.yyy
Error joining address 'srv-002.xxx.consul.yyy': Unexpected response code: 500 (1 error occurred:
	* Failed to join 95.xxx.xxx.100: Member 'srv-002.zzz.consul.yyy' has conflicting node ID '518ab4e0-d07a-509d-21bc-cdd94ff6c212' with member 'srv-002-zzz'

So far I have tried doing a consul force-leave -prune for both conflicting names (e.g. srv-002-zzz and srv-002.zzz.consul.yyy) - repeatedly, until the consul leader claims there is no such member. Waiting another 10 minutes after that, attempting to join an agent (or server) to the cluster results right back in the same error.

Hoping the next release can happen soon...

KLIM8D · 2020-04-22T11:21:25Z

Restarting the consul server was needed in my case, before any node could join or even rejoin. If I stopped the consul service on any client, that was already a part of the cluster, they were unable to re-join afterwards getting the same error message about conflicting node id's.

Removing everything in the data-dir and starting consul with -disable-host-node-id or -node-id=.. on the new client didn't have any effect. As soon as the server was restarted, they joined the cluster.

Build: 1.7.2

quinndiggityaxiom · 2020-04-24T20:29:55Z

@KLIM8D you restarted ALL your consul servers, or do you only have one?

We are experiencing the same issue, and no nodes can join the cluster because nodes that aren't even in the cluster have conflicting IDs.

So, for example, because staging-docker-swarm-001 conflicts with docker-swarm-001 (and neither are in the cluster, aren't appearing as members under consul members, and have been consul force-leave -prune staging-docker-swarm-001/consul force-leave -prune docker-swarm-001 multiple times and are clearly stating No node found with name ...), staging-worker-abcde can't join the cluster, and everything is broken

Servers can't rejoin the cluster after restarting consul, sooooooo, we will have to rebuild Vault from scratch as its data is on consul and the cluster is borked

@hashicorp-support @i0rek can you please put some urgency on this

quinndiggityaxiom · 2020-04-24T20:53:24Z

Yep, this bug leads to complete data loss with consul

------ Warning, do not upgrade to consul 1.7.x ------

This is a disaster, how did this make it through QA?

KLIM8D · 2020-04-27T11:22:21Z

@quinndiggityaxiom Yes, I've only one consul server. I'm not sure whenever all servers has to be restarted, or just one, if you have more than one consul server

jsosulska · 2020-04-28T14:30:19Z

Thank you to everyone who has been so patient in reporting this. We're actively working on this and are looking to get a fix out soon. We're tracking this for 1.7, and have seen it in 1.5 too.

Seems like #7445 and #7692 are also related.

We'll be using this issue to track all conflicting NodeID issues.

princepathria97 · 2020-06-29T08:08:49Z

Any workaround on older versions? Not able join new node by

data dir cleanup
leave
force-leave
copy data from running node then start
disable node id
I'm facing the same error in 0.8.4 raft v2

rustamgk · 2020-07-01T09:59:32Z

Hi guys, any progress on this issue ? We are facing the same on 1.7.2.

hanshasselberg · 2020-07-01T20:52:49Z

@princepathria97 assuming the node that you want to used the node-id from is no longer alive, you should be able to boot the new node after 72h. This is how long memberlist will hold onto the node-id. If you are still experiencing problems please open a separate issue with a reference to this one.

@rustamgk yes, good news. We merge a fix and released it in Consul v1.7.3: #7747.

alkalinecoffee · 2020-08-20T16:23:45Z

Is there any guidance on whether our clients should upgrade to 1.7.3 (but servers remain on 1.7.2)? Or are we required to upgrade our servers to effectively resolve this issue?

We're unable to upgrade our servers to 1.7.3 at the moment, so we're wondering if our clients could run 1.7.3 to get around this issue until we can get to upgrading the servers.

ryanmt · 2021-04-08T16:18:15Z

I've run into this on 1.6.6. Is there any technical reason why this fix wasn't pulled into 1.6.x releases? I don't see the change from pull #7747 in v1.6.10.

srinirei · 2021-05-26T17:05:05Z

Any update on this issue? We are using 17.3 and 3 consul servers. We have tight schedules for deployments and this one is failing our deployments. Restarting servers would not be a convenient one every time we see the issue.

KLIM8D · 2021-05-27T14:50:42Z

@srinirei Update consul. We're not experiencing this on v1.9.3

andrewnazarov · 2021-07-28T16:38:38Z

We are experiencing the same on v1.10.0 (chart version 0.32.1). We noticed that -disable-host-node-id=false is removed in the newer version of the Chart.

xkkker · 2021-08-12T03:29:00Z

any updates?

nbari · 2021-09-27T09:36:49Z

setting disable_host_node_id to either false or true now helping, when creating 10 VMS for example half of them get the same node-id, any workaround?

acodetailor · 2022-05-30T09:55:12Z

meet same problem in 1.5.3

msirovy · 2022-05-31T08:49:42Z

Consul 1.9.0 - same problem

Amier3 · 2022-06-03T01:11:08Z

Hey @msirovy and @acodetailor

We'd suggest an upgrade to 1.9.3 at least to see if that fixes the issue. It seems like this issue was partly fixed, but the bug is still occuring in some circumstances.

We'll continue tracking this bug in #13037 since it contains the most relevant and up to date explanation on what's going on.

3coma3 · 2022-06-06T11:36:01Z

For what is worth stopping the agent, removing the node-id file in the data directory and starting back the agent solved this problem for me (it was originated on cloned VMs). Upon restarting the agents generate a new distinct id and are able to join.

Wicaeed · 2023-07-13T21:53:32Z

^ Doing this also resolved the alerts for me in our EKS environment running Consul 1.16.0. Just had to exec into the offending hosts consul-client pod and remove the /data/node-id file manually, then deleted the pod itself to allow the daemonset to restart it.

As soon as we did this the alerts ceased. Now to identify exactly why this occurs...

hanshasselberg self-assigned this Mar 5, 2020

jsosulska added needs-investigation The issue described is detailed and complex. theme/internals Serf, Raft, SWIM, Lifeguard, Anti-Entropy, locking topics type/bug Feature does not function as expected labels Apr 20, 2020

jsosulska added the type/umbrella-☂️ Makes issue the "source of truth" for multiple requests relating to the same topic label Apr 28, 2020

This was referenced Apr 28, 2020

Errors and warnings renaming Node ID [1.7.2] #7692

Closed

Duplicate Node IDs on consul 1.5.3 #7445

Closed

hanshasselberg mentioned this issue Apr 30, 2020

Don't let left nodes hold onto their node-id #7747

Merged

hanshasselberg closed this as completed in #7747 May 4, 2020

alkalinecoffee mentioned this issue Jul 21, 2020

Consul Upgrade with Replicate Results in Missing KVs #8351

Closed

jsosulska mentioned this issue Feb 5, 2021

Consul agent generated the same node ID for couple hosts #9467

Open

markan mentioned this issue May 13, 2022

Joining cluster with duplicate Node Id causes gossip failures #13037

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unable to join consul 1.7.x cluster due to other members having conflicting node id's #7396

unable to join consul 1.7.x cluster due to other members having conflicting node id's #7396

TomRitserveldt commented Mar 5, 2020

TomRitserveldt commented Mar 5, 2020

schristoff commented Mar 19, 2020

aep commented Mar 20, 2020 •

edited

Loading

TomRitserveldt commented Mar 23, 2020 •

edited

Loading

leeif commented Mar 23, 2020

shamil commented Mar 29, 2020 •

edited

Loading

benvanstaveren commented Apr 20, 2020

hanshasselberg commented Apr 20, 2020

benvanstaveren commented Apr 20, 2020 •

edited

Loading

KLIM8D commented Apr 22, 2020

quinndiggityaxiom commented Apr 24, 2020

quinndiggityaxiom commented Apr 24, 2020

KLIM8D commented Apr 27, 2020 •

edited

Loading

jsosulska commented Apr 28, 2020

princepathria97 commented Jun 29, 2020

rustamgk commented Jul 1, 2020

hanshasselberg commented Jul 1, 2020

alkalinecoffee commented Aug 20, 2020

ryanmt commented Apr 8, 2021

srinirei commented May 26, 2021

KLIM8D commented May 27, 2021

andrewnazarov commented Jul 28, 2021 •

edited

Loading

xkkker commented Aug 12, 2021

nbari commented Sep 27, 2021

acodetailor commented May 30, 2022

msirovy commented May 31, 2022

Amier3 commented Jun 3, 2022

3coma3 commented Jun 6, 2022 •

edited

Loading

Wicaeed commented Jul 13, 2023 •

edited

Loading

unable to join consul 1.7.x cluster due to other members having conflicting node id's #7396

unable to join consul 1.7.x cluster due to other members having conflicting node id's #7396

Comments

TomRitserveldt commented Mar 5, 2020

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

TomRitserveldt commented Mar 5, 2020

schristoff commented Mar 19, 2020

aep commented Mar 20, 2020 • edited Loading

TomRitserveldt commented Mar 23, 2020 • edited Loading

leeif commented Mar 23, 2020

shamil commented Mar 29, 2020 • edited Loading

benvanstaveren commented Apr 20, 2020

hanshasselberg commented Apr 20, 2020

benvanstaveren commented Apr 20, 2020 • edited Loading

KLIM8D commented Apr 22, 2020

quinndiggityaxiom commented Apr 24, 2020

quinndiggityaxiom commented Apr 24, 2020

KLIM8D commented Apr 27, 2020 • edited Loading

jsosulska commented Apr 28, 2020

princepathria97 commented Jun 29, 2020

rustamgk commented Jul 1, 2020

hanshasselberg commented Jul 1, 2020

alkalinecoffee commented Aug 20, 2020

ryanmt commented Apr 8, 2021

srinirei commented May 26, 2021

KLIM8D commented May 27, 2021

andrewnazarov commented Jul 28, 2021 • edited Loading

xkkker commented Aug 12, 2021

nbari commented Sep 27, 2021

acodetailor commented May 30, 2022

msirovy commented May 31, 2022

Amier3 commented Jun 3, 2022

3coma3 commented Jun 6, 2022 • edited Loading

Wicaeed commented Jul 13, 2023 • edited Loading

aep commented Mar 20, 2020 •

edited

Loading

TomRitserveldt commented Mar 23, 2020 •

edited

Loading

shamil commented Mar 29, 2020 •

edited

Loading

benvanstaveren commented Apr 20, 2020 •

edited

Loading

KLIM8D commented Apr 27, 2020 •

edited

Loading

andrewnazarov commented Jul 28, 2021 •

edited

Loading

3coma3 commented Jun 6, 2022 •

edited

Loading

Wicaeed commented Jul 13, 2023 •

edited

Loading