You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug:
It has been observed in between of scaling up(1->3 replicas) a non-HA etcd cluster to HA etcd cluster if etcd's first member pod restarted(due to any reason) then this restart of etcd first cluster member in between of scaling-up a cluster might lead to permanent quorum loss when second etcd cluster joined the cluster successfully but third etcd cluster member didn't join the cluster yet or didn’t started yet.
So, now first and third member both are down at the same time which leads to permanent quorum loss.
status:
clusterSize: 1
conditions:
- lastTransitionTime: "2023-07-11T10:16:09Z"
lastUpdateTime: "2023-07-11T11:12:40Z"
message: At least one member is not ready
reason: NotAllMembersReady
status: "False"
type: AllMembersReady
- lastTransitionTime: "2023-07-11T10:29:39Z"
lastUpdateTime: "2023-07-11T11:12:40Z"
message: Stale snapshot leases. Not renewed in a long time
reason: BackupFailed
status: "False"
type: BackupReady
- lastTransitionTime: "2023-07-11T10:16:09Z"
lastUpdateTime: "2023-07-11T11:12:40Z"
message: The majority of ETCD members is not ready
reason: QuorumLost
status: "False"
type: Ready
currentReplicas: 1
etcd:
apiVersion: apps/v1
kind: StatefulSet
name: etcd-main
members:
- id: bd585ad8a06f8cfb
lastTransitionTime: "2023-07-11T10:16:09Z"
name: etcd-main-0
reason: UnknownGracePeriodExceeded
role: Leader
status: NotReady
observedGeneration: 1
ready: false
replicas: 1
serviceName: etcd-main-client
updatedReplicas: 2
Expected behaviour:
We can't control the restart of etcd pod as restart of etcd member pod could be cause by some infra issue or some after effect but we can avoid this permanent quorum loss from happening.
Logs: backup-restore logs of etcd-main-0 pod:
2023-07-11T11:10:59.231764125Z stderr F time="2023-07-11T11:10:59Z" level=error msg="Failed to connect to etcd KV client: context deadline exceeded" actor=member-add
2023-07-11T11:10:59.231770196Z stderr F time="2023-07-11T11:10:59Z" level=error msg="unable to check presence of member in cluster: context deadline exceeded" actor=member-add
2023-07-11T11:10:59.235463591Z stderr F time="2023-07-11T11:10:59Z" level=info msg="Etcd cluster scale-up is detected" actor=initializer
.
.
.
2023-07-11T11:12:33.848388409Z stderr F time="2023-07-11T11:12:33Z" level=fatal msg="unable to add a learner in a cluster: error while adding member as a learner: context deadline exceeded" actor=initializer
Environment (please complete the following information):
It happened because scale-up annotation: gardener.cloud/scaled-to-multi-node present in etcd statefulset which was added by etcd-druid as cluster was marked for scale-up and from the etcd-druid prospective scale-up hasn’t completed yet, hence it won't remove the scale-up annotation until it gets complete.
IMO, the problem is happening due to restart of first etcd cluster member which also detects scale-up(proved by logs) due to scale-up annotation: gardener.cloud/scaled-to-multi-node present in etcd statefulset and this leads to false information passed to etcd's first cluster member that it should be also be added as a learner, hence it takes wrong path of code.
Proposed solution:
IMO, etcd's first cluster member can never be a part of scale-up scenario, so why backup-restore should check the scale-up detection conditions for first etcd cluster member, and that's why in case of first etcd cluster member restart, backup-restore can simply skip checking the scale-up detection conditions and move forward to check data-dir validation and rest of conditions.
By doing this, only a restart will occur and etcd's first cluster member can avoid taking a wrong path and this will move our issue from permanent quorum loss -> transient quorum loss 😄
Describe the bug:
It has been observed in between of scaling up(
1->3 replicas
) a non-HA etcd cluster to HA etcd cluster if etcd's first member pod restarted(due to any reason) then this restart of etcd first cluster member in between of scaling-up a cluster might lead to permanent quorum loss when second etcd cluster joined the cluster successfully but third etcd cluster member didn't join the cluster yet or didn’t started yet.So, now first and third member both are down at the same time which leads to permanent quorum loss.
etcd CR status:
Expected behaviour:
We can't control the restart of etcd pod as restart of etcd member pod could be cause by some infra issue or some after effect but we can avoid this permanent quorum loss from happening.
Logs:
backup-restore
logs ofetcd-main-0
pod:Environment (please complete the following information):
Anything else we need to know?:
Failed Prow jobs:
The text was updated successfully, but these errors were encountered: