-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't add or remove node in unhealthy cluster #6103
Comments
So writes aren't going through and How are you starting the new member? I looked through If quorum is broken, the cluster would need to be rebuilt from backup. https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery is the safest way to do it. I'd need to see the etcdserver logs to know for sure. #6106 should prevent this sort of unsafe member add from going through in the future |
Why there are two unhealthy members? I understand you have 1 dead one due to aws instance issue. |
Timeline. [3 healthy nodes] --> At this point, etcd returns some error but is limping along. If we bring it down, our product stops working and customers loose data. I don't see any other way to recover other than backup data directory and form a new cluster. We found a way to have a AWS node impersonate the dead instance with it's data volume and it works for now until we can migrate off of etcd. But there's no way I can see, once quorum is lost, to add new machines to attempt to recover quorum, like --force or whatever. There needs to be a way of overriding "context deadline exceeded" and adding nodes to the cluster as part of recovery. Suggestions to do this have been met with "no, we won't do that". This decision should be reconsidered. I don't see any other way to reform an existing cluster when you can't take it down to do "backup/restore". |
What you have done for fixing a failed node is not recommended. We have docs talking about the exact same problem you faced, and provide detailed instructions here: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/runtime-configuration.md#replace-a-failed-machine. In summary: always remove first, add later. No add first, remove later. I am not convinced that we need to override membership at runtime based on your experience. What we can do is as @heyitsanthony suggested: avoid users from further damaging their clusters in case of some failures. |
The "remove first" scenario was considered to risky by our developers, which is why I added first. So you're position is "it's not broken so we won't fix it" still stands. I'll escalate this to them. |
Do they have a reason for this? I would like to hear. Then we might reconsider the options. |
@michael-px that's not our position; it's that abandoning quorum is really risky (especially when the cluster is already in a bad way). We're aware that losing quorum is painful, but disabling quorum on membership could lead to full fledged cluster inconsistency and that would be even worse in many applications ("disk geometry corruption" being a candidate for most terrifying). It's too dangerous to be a legitimate fix, sorry. Permitting a I totally agree the current workflow around quorum loss is crummy. @xiang90 suggested seamless failover to a new cluster could be done setting up a proxy in front of the quorumless cluster, recover into a new cluster, then point the proxy to the new cluster. Sort of complicated to do by hand, though. |
@michael-px I am closing this out since I feel we have explained the reason in detail. We would like to know why your engs think adding first is better. If there is a valid reason, we might reconsider your suggestion. Thanks. |
Copy Anthony's answer from: etcd-io#6103 etcd-io#6114
Copy Anthony's answer from: etcd-io#6103 etcd-io#6114
An AWS instance failed that's running etcd 2.3.7 in a container.
I went to the "leader" node of the remaining cluster and tried to remove the dead member and got a "context deadline exceeded" error. According to various posts, this is due to the cluster being unhealthy.
I tried adding a node to the cluster with the "etcdctl member add", but when I brought up that node, it's cluster ID was different from the existing cluster and it wouldn't join.
I now have a cluster with 1 dead member, 1 member in the "unstarted" state, and two unhealthy nodes.
I tried adding another node with a different IP address and I get "context deadline exceeded".
At this point, I'm don't see any way to recover this cluster, other than to shut everything down, do a etcdctl backup with an existing set of data and form a new cluster.
I read here:
#3505
that there's a "--force-new-cluster" but don't see it in the help.
This cluster is production and it's currently down. What's the next step?
The text was updated successfully, but these errors were encountered: