-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏃[KCP] Recover from a manual machine deletion #2841
🏃[KCP] Recover from a manual machine deletion #2841
Conversation
The only place we understand if the machine is missing is during ETCD healthcheck by comparing ETCD members and nodes in the workload cluster. So, doing this check every time ETCD health is checked makes sense IMO but need some feedback on this. |
I opened another PR to perform health checks before scale up/down in reconcile logic. Will continue on this PR ones it is merged. |
b0e67d6
to
b5312bf
Compare
b5312bf
to
5f0b80a
Compare
5f0b80a
to
9c105fc
Compare
9c105fc
to
feade58
Compare
feade58
to
ad05b58
Compare
I don't have any other comments – I would /lgtm but I also don't have the permission bit for that to do anything :) |
@sethp-nr you should be able to lgtm :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign @benmoss
/milestone v0.3.4 |
I've been seeing this flake a lot, should we file an issue to look at it separately? |
/test pull-cluster-api-capd-e2e |
@vincepri Yes I am opening an issue about this. I think it is related to timeouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of nits about wording and capitalizing etcd and just one real complaint about the nested for loops in ReconcileEtcdMembers
// Wait for any delete in progress to complete before deleting another Machine | ||
if controlPlane.HasDeletingMachine() { | ||
return ctrl.Result{}, &capierrors.RequeueAfterError{RequeueAfter: deleteRequeueAfter} | ||
} | ||
|
||
// We don't want to health check at the beginning of this method to avoid blocking re-entrancy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// We don't want to health check at the beginning of this method to avoid blocking re-entrancy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should remove this before making sure moving healthcheck at the beginning of this function does not break anything related to machine deletion. We can follow up on this in a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, you want to leave the comment as a reminder that if things break it's because of the healthcheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either we should move the comment back up there where it was, or remove it altogether
test/infrastructure/docker/e2e/docker_machine_remediation_test.go
Outdated
Show resolved
Hide resolved
also cc @JoelSpeed @enxebre |
d33801b
to
894b9b8
Compare
/test pull-cluster-api-capd-e2e |
1 similar comment
/test pull-cluster-api-capd-e2e |
@sedefsavas when you have a chance, can you make sure to rebase on master? |
894b9b8
to
a573a2c
Compare
test/infrastructure/docker/e2e/docker_machine_remediation_test.go
Outdated
Show resolved
Hide resolved
3a3ae47
to
3acda52
Compare
I probably won't be able to dedicate a solid enough chunk of time to it |
I can queue it up for a review tomorrow morning |
/assign @detiber |
LGTM, I'll leave final signoff to @detiber |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few tangential items that could be handled as a separate items:
- it might be good if we could differentiate some of the different failures/error conditions, so that we only reconcile membership if the error/failure is related to mismatched machine/etcd membership
- we don't currently differentiate between an etcd static pod that hasn't yet been configured vs one that is crash looping:
cluster-api/controlplane/kubeadm/internal/workload_cluster_etcd.go
Lines 69 to 77 in a19b557
if err := checkStaticPodReadyCondition(pod); err != nil { // Nothing wrong here, etcd on this node is just not running. // If it's a true failure the healthcheck will fail since it won't have checked enough members. continue } // Only expect a member reports healthy if its pod is ready. // This fixes the known state where the control plane has a crash-looping etcd pod that is not part of the // etcd cluster. expectedMembers++
/hold cancel |
What this PR does / why we need it:
This PR adds KCP logic to recover from a manual machine deletion. When an ETCD member with missing machine is detected, delete it from the ETCD members and from Kubeadm configmap.
Which issue(s) this PR fixes
Fixes #2818