-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Handle quorum loss scenario by Druid in ETCD multinode #362
Comments
The scope of the issue is to restore a multinode ETCD cluster after the quorum is lost. The steps are like following:
A PR (#382) is raised to take care of the above mentioned steps. In the step 5, we are scaling up ETCD from 1 to the number of the replicas mentioned in the CR. This step needs help from the scale up mechanism already implemented in ETCD BR (gardener/etcd-backup-restore#487). In the scale up mechanism, a multi node cluster should be able to scale up from 1 to 3 members. But that is not happening in co-ordination with quorum loss scenario that we are implementing. The newly added members due to scaling up are not joining to the cluster. Instead they are facing some errors which need to be resolved from ETCD BR side. I am currently studying why the errors are coming from ETCD BR side. This error was captured while we are trying to test the quorum loss setup with our local ETCD cluster. This is a manual test so far. Once we run this tests successfully we will test quorum loss scenario with local gardener setup. Steps of the manual test that we are doing with our local ETCD cluster:
|
As mitigation plan until the quorum loss scenario is available, a DoD has to step in if quorum loss happens. S/he has to change the replicas in ETCD CR to 1. Scaledown the ETCD statefulset to zero, delete all of the PVCs and then scale up the Sts to 1. |
Do we really need to delete the volumes at all times? Quorum loss can happen for different reasons, e.g. |
Feature (What you would like to be added):
We need to reconcile ETCD cluster in case of quorum loss
Motivation (Why is this needed?):
When n/2+1 number of ETCD members are down in a cluster of size n, the cluster becomes unavailable. This situation is called Quorum Loss. To recover from quorum loss scenario, ETCD Druid needs to delete ETCD statefulset and redeploy again.
Approach/Hint to the implement solution (optional):
ETCD druid will delete the statefulset and redeploy the statefulset with replicas as many as replicas mentioned in ETCD CR
Flow:
If quorum loss is detected, Druid will put an annotation in ETCD CR to indicate that cluster is being restarted.
The text was updated successfully, but these errors were encountered: