Can't add or remove node in unhealthy cluster #6103

michael-px · 2016-08-04T22:21:37Z

An AWS instance failed that's running etcd 2.3.7 in a container.

I went to the "leader" node of the remaining cluster and tried to remove the dead member and got a "context deadline exceeded" error. According to various posts, this is due to the cluster being unhealthy.

I tried adding a node to the cluster with the "etcdctl member add", but when I brought up that node, it's cluster ID was different from the existing cluster and it wouldn't join.

I now have a cluster with 1 dead member, 1 member in the "unstarted" state, and two unhealthy nodes.

I tried adding another node with a different IP address and I get "context deadline exceeded".

At this point, I'm don't see any way to recover this cluster, other than to shut everything down, do a etcdctl backup with an existing set of data and form a new cluster.

I read here:

#3505

that there's a "--force-new-cluster" but don't see it in the help.

This cluster is production and it's currently down. What's the next step?

heyitsanthony · 2016-08-05T02:59:09Z

@michael-px

So writes aren't going through and etcdctl member list gives 4 members?

How are you starting the new member? I looked through etcdctl member add and it looks like it's miscomputing ETCD_INITIAL_CLUSTER which could cause a cluster id mismatch.

If quorum is broken, the cluster would need to be rebuilt from backup. https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery is the safest way to do it. I'd need to see the etcdserver logs to know for sure.

#6106 should prevent this sort of unsafe member add from going through in the future

xiang90 · 2016-08-05T03:01:14Z

I now have a cluster with 1 dead member, 1 member in the "unstarted" state, and two unhealthy nodes.

Why there are two unhealthy members?

I understand you have 1 dead one due to aws instance issue.

michael-px · 2016-08-05T16:23:16Z

Timeline.

[3 healthy nodes] -->
AWS shutdown node -->
2 unhealthy nodes + 1 unreachable node -->
add member but node in state "unstarted" (successful* but added with initial-cluster-state set wrong so id is wrong won't join existing cluster) -->
all attempts to add/remove/change cluster fail because it doesn't have quorum

At this point, etcd returns some error but is limping along. If we bring it down, our product stops working and customers loose data.

I don't see any other way to recover other than backup data directory and form a new cluster.

We found a way to have a AWS node impersonate the dead instance with it's data volume and it works for now until we can migrate off of etcd. But there's no way I can see, once quorum is lost, to add new machines to attempt to recover quorum, like --force or whatever.

There needs to be a way of overriding "context deadline exceeded" and adding nodes to the cluster as part of recovery. Suggestions to do this have been met with "no, we won't do that". This decision should be reconsidered. I don't see any other way to reform an existing cluster when you can't take it down to do "backup/restore".

xiang90 · 2016-08-05T17:14:40Z

@michael-px

What you have done for fixing a failed node is not recommended. We have docs talking about the exact same problem you faced, and provide detailed instructions here: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/runtime-configuration.md#replace-a-failed-machine.

In summary: always remove first, add later. No add first, remove later.

I am not convinced that we need to override membership at runtime based on your experience. What we can do is as @heyitsanthony suggested: avoid users from further damaging their clusters in case of some failures.

michael-px · 2016-08-05T17:29:11Z

The "remove first" scenario was considered to risky by our developers, which is why I added first.

So you're position is "it's not broken so we won't fix it" still stands.

I'll escalate this to them.

xiang90 · 2016-08-05T17:32:37Z

The "remove first" scenario was considered to risky by our developers

Do they have a reason for this? I would like to hear. Then we might reconsider the options.

heyitsanthony · 2016-08-05T17:50:37Z

@michael-px that's not our position; it's that abandoning quorum is really risky (especially when the cluster is already in a bad way). We're aware that losing quorum is painful, but disabling quorum on membership could lead to full fledged cluster inconsistency and that would be even worse in many applications ("disk geometry corruption" being a candidate for most terrifying). It's too dangerous to be a legitimate fix, sorry. Permitting a member add when the cluster is unhealthy is clearly broken and the fix for that, which is safe, is already inflight.

I totally agree the current workflow around quorum loss is crummy. @xiang90 suggested seamless failover to a new cluster could be done setting up a proxy in front of the quorumless cluster, recover into a new cluster, then point the proxy to the new cluster. Sort of complicated to do by hand, though.

xiang90 · 2016-08-09T16:50:44Z

@michael-px I am closing this out since I feel we have explained the reason in detail. We would like to know why your engs think adding first is better. If there is a valid reason, we might reconsider your suggestion. Thanks.

Copy Anthony's answer from: etcd-io#6103 etcd-io#6114

xiang90 closed this as completed Aug 9, 2016

gyuho changed the title ~~can't add or remove node in unhealthy cluster~~ Can't add or remove node in unhealthy cluster Sep 29, 2016

gyuho added a commit to gyuho/etcd that referenced this issue Dec 16, 2016

Documentation: add FAQs on membership operation

851b0bb

Copy Anthony's answer from: etcd-io#6103 etcd-io#6114

gyuho mentioned this issue Dec 16, 2016

Documentation: add FAQs on membership operation #7028

Merged

gyuho added a commit to gyuho/etcd that referenced this issue Dec 16, 2016

Documentation: add FAQs on membership operation

2f0e82a

Copy Anthony's answer from: etcd-io#6103 etcd-io#6114

jingyih mentioned this issue Jun 9, 2019

Can't add new member when has three alive member at four member cluster #10738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't add or remove node in unhealthy cluster #6103

Can't add or remove node in unhealthy cluster #6103

michael-px commented Aug 4, 2016

heyitsanthony commented Aug 5, 2016

xiang90 commented Aug 5, 2016

michael-px commented Aug 5, 2016

xiang90 commented Aug 5, 2016 •

edited

Loading

michael-px commented Aug 5, 2016

xiang90 commented Aug 5, 2016

heyitsanthony commented Aug 5, 2016

xiang90 commented Aug 9, 2016

Can't add or remove node in unhealthy cluster #6103

Can't add or remove node in unhealthy cluster #6103

Comments

michael-px commented Aug 4, 2016

heyitsanthony commented Aug 5, 2016

xiang90 commented Aug 5, 2016

michael-px commented Aug 5, 2016

xiang90 commented Aug 5, 2016 • edited Loading

michael-px commented Aug 5, 2016

xiang90 commented Aug 5, 2016

heyitsanthony commented Aug 5, 2016

xiang90 commented Aug 9, 2016

xiang90 commented Aug 5, 2016 •

edited

Loading