-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to join consul 1.7.x cluster due to other members having conflicting node id's #7396
Comments
relevant discuss post: |
Hey @TomRitserveldt , Thank you so much for bringing this up to us. It sounds like the problem you are seeing is that once a node enters a "left" state any other nodes cannot join the cluster (regardless of ID), is this true? Based off your replication steps you've given it looks like once the node is in the left state you are trying to add a new node with the same ID. This will always fail because of the TombstoneTimeout in the Serf library. The TombstoneTimeout makes it so a node has to wait 24hours once entering the left state to be reaped. Once the node is reaped all the node's data is gone from the cluster and the node's ID, IP, etc can be reassigned. If you'd like to bypass this timeout I would recommend looking into the Please let me know if the issue is the first item I mentioned and we can continue digging in. |
same issue. force-leave has no effect and the conflicts no longer shows up in members on any of the other nodes. is there a way to set the tombstone timeout to something like 1 second to prevent this? restarting the leader manually helps btw, even tho the error says the conflicts are on some other node |
@s-christoff yes, any other unrelated nodes are unable to join the cluster because of these conflicting id's, regardless of their own id. As @aep said as well, force-leave command does not fix this. the only way to fix for now is restart a consul server/leader every time this error occurs. EDIT: We also do not have this issue at all running consul servers of any version below 1.7.x, so we feel like some behaviour regarding left nodes must have changed there. Even though we see nothing that would indicate that in the changelog |
Same here. Conflict error occurs when I change my server's hostname(which is used as the node name). force-leave has no effect and |
I'm experiencing exactly same issue, even after I'm running consul |
We also have this exact same issue; except with Consul 1.7.2 - renaming a node and restarting it is now causing all agents to be unable to rejoin, restarting the leader also does not solve the issue. |
Thank you for reporting! This is something we will look into for the next release! |
Spinning up a new server node and having it join the cluster also fails:
So far I have tried doing a Hoping the next release can happen soon... |
Restarting the consul server was needed in my case, before any node could join or even rejoin. If I stopped the consul service on any client, that was already a part of the cluster, they were unable to re-join afterwards getting the same error message about conflicting node id's. Removing everything in the Build: 1.7.2 |
@KLIM8D you restarted ALL your consul servers, or do you only have one? We are experiencing the same issue, and no nodes can join the cluster because nodes that aren't even in the cluster have conflicting IDs. So, for example, because Servers can't rejoin the cluster after restarting consul, sooooooo, we will have to rebuild Vault from scratch as its data is on consul and the cluster is borked @hashicorp-support @i0rek can you please put some urgency on this |
Yep, this bug leads to complete data loss with consul ------ Warning, do not upgrade to consul 1.7.x ------ This is a disaster, how did this make it through QA? |
@quinndiggityaxiom Yes, I've only one consul server. I'm not sure whenever all servers has to be restarted, or just one, if you have more than one consul server |
Thank you to everyone who has been so patient in reporting this. We're actively working on this and are looking to get a fix out soon. We're tracking this for 1.7, and have seen it in 1.5 too. Seems like #7445 and #7692 are also related. We'll be using this issue to track all conflicting NodeID issues. |
Any workaround on older versions? Not able join new node by
|
Hi guys, any progress on this issue ? We are facing the same on 1.7.2. |
@princepathria97 assuming the node that you want to used the node-id from is no longer alive, you should be able to boot the new node after 72h. This is how long memberlist will hold onto the node-id. If you are still experiencing problems please open a separate issue with a reference to this one. @rustamgk yes, good news. We merge a fix and released it in Consul v1.7.3: #7747. |
Is there any guidance on whether our clients should upgrade to 1.7.3 (but servers remain on 1.7.2)? Or are we required to upgrade our servers to effectively resolve this issue? We're unable to upgrade our servers to 1.7.3 at the moment, so we're wondering if our clients could run 1.7.3 to get around this issue until we can get to upgrading the servers. |
I've run into this on 1.6.6. Is there any technical reason why this fix wasn't pulled into 1.6.x releases? I don't see the change from pull #7747 in v1.6.10. |
Any update on this issue? We are using 17.3 and 3 consul servers. We have tight schedules for deployments and this one is failing our deployments. Restarting servers would not be a convenient one every time we see the issue. |
@srinirei Update consul. We're not experiencing this on v1.9.3 |
any updates? |
setting |
meet same problem in 1.5.3 |
Consul 1.9.0 - same problem |
Hey @msirovy and @acodetailor We'd suggest an upgrade to 1.9.3 at least to see if that fixes the issue. It seems like this issue was partly fixed, but the bug is still occuring in some circumstances. We'll continue tracking this bug in #13037 since it contains the most relevant and up to date explanation on what's going on. |
For what is worth stopping the agent, removing the node-id file in the data directory and starting back the agent solved this problem for me (it was originated on cloned VMs). Upon restarting the agents generate a new distinct id and are able to join. |
^ Doing this also resolved the alerts for me in our EKS environment running Consul 1.16.0. Just had to exec into the offending hosts consul-client pod and remove the As soon as we did this the alerts ceased. Now to identify exactly why this occurs... |
Overview of the Issue
since 1.7.0 we notice issues with members not being able to join because other clients in the DC have "conflicting" node ID's. example error below on a test environment in AWS, note that the client unable to join a completely different node than the one with conflicting node id:
in this test environment I took 1 of the client, changed the node name and restarted the consul service (leaving the cluster and rejoining with a new name). on 1.6.4, this works and does not block other servers from joining.
below, consul member list from a server in the cluster, showing the 2 old node names as status left
On consul 1.7.x the status for those clients is also "left" BUT as shown in the first log output, this blocks other clients from joining the cluster.
I think "left" clients should not cause duplicate id's and should definitely not block other clients from joining the cluster.
Reproduction Steps
Steps to reproduce this issue, eg:
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
Distributor ID: Debian
Description: Debian GNU/Linux 9.11 (stretch)
Release: 9.11
Codename: stretch
nodes in AWS
The text was updated successfully, but these errors were encountered: