-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Recover ETCD multinode cluster from transient quorum loss #436
Comments
This feature is addressed by gardener/etcd-backup-restore#528 in ETCDBR side |
I tested a transient quorum loss scenarios with Etcd-Druid release
Because of the steps mentioned above, it's unclear why a transient quorum loss is currently not covered and under which circumstances it causes problems. Can you please explain your test path(s)? |
yes, as long as PVC of majority of etcd cluster members is intact and no data-dir corruption happen to those members then etcd cluster can recover from transient quorum loss without any intervention. |
Closing this issue as transient quorum loss is handled (refer: #436 (comment)) |
Feature (What you would like to be added):
Transient quorum loss in a multinode ETCD cluster happens when most(>= n/2 + 1) ETCD pods can't join to the cluster due to network error, pod scheduling error, high CPU/Mem usage etc. Transient quorum loss generally lasts for short period of time. ETCD cluster remain unavailable during that time. But when the failed pods restart again properly, they should join the cluster and make ETCD cluster available like before.
Motivation (Why is this needed?):
Recovery from a transient quorum loss is supported by ETCD multinode cluster. But it is currently not working with our current implementation of ETCD backup-restore. Currently in Backup-Restore, we are taking extra action to restore single node and scale up ETCD cluster. The action is blocking normal path of recovering from transient quorum loss. So, we need this feature to recover from transient quorum loss, single node restoration and scale up.
Approach/Hint to the implement solution (optional):
If a ETCD pod restarts and it has valid ETCD data directory, let it join to the cluster as per it's ETCD config.
The text was updated successfully, but these errors were encountered: