Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🏃[KCP] Recover from a manual machine deletion #2841

Merged
merged 2 commits into from
Apr 14, 2020

Conversation

sedefsavas
Copy link

What this PR does / why we need it:
This PR adds KCP logic to recover from a manual machine deletion. When an ETCD member with missing machine is detected, delete it from the ETCD members and from Kubeadm configmap.

Which issue(s) this PR fixes
Fixes #2818

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 1, 2020
@k8s-ci-robot k8s-ci-robot requested review from chuckha and vincepri April 1, 2020 16:30
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 1, 2020
@sedefsavas
Copy link
Author

The only place we understand if the machine is missing is during ETCD healthcheck by comparing ETCD members and nodes in the workload cluster. So, doing this check every time ETCD health is checked makes sense IMO but need some feedback on this.

@sedefsavas
Copy link
Author

I opened another PR to perform health checks before scale up/down in reconcile logic.
#2849

Will continue on this PR ones it is merged.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 2, 2020
@sedefsavas sedefsavas force-pushed the kcpmachinedeletion branch from b0e67d6 to b5312bf Compare April 2, 2020 19:20
@sedefsavas sedefsavas force-pushed the kcpmachinedeletion branch from b5312bf to 5f0b80a Compare April 2, 2020 21:04
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2020
@sedefsavas sedefsavas changed the title [WIP] 🏃[KCP] Recover from a manual machine deletion 🏃[KCP] Recover from a manual machine deletion Apr 2, 2020
@sedefsavas sedefsavas force-pushed the kcpmachinedeletion branch from 5f0b80a to 9c105fc Compare April 2, 2020 21:08
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2020
@sedefsavas sedefsavas force-pushed the kcpmachinedeletion branch from 9c105fc to feade58 Compare April 2, 2020 21:12
@sedefsavas
Copy link
Author

@vincepri @sethp-nr Feel free to add any other comments you have for this PR. It is no longer waiting for the PR I mentioned above.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2020
@sedefsavas sedefsavas force-pushed the kcpmachinedeletion branch from feade58 to ad05b58 Compare April 6, 2020 18:22
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2020
@sethp-nr
Copy link
Contributor

sethp-nr commented Apr 6, 2020

I don't have any other comments – I would /lgtm but I also don't have the permission bit for that to do anything :)

@vincepri
Copy link
Member

vincepri commented Apr 6, 2020

@sethp-nr you should be able to lgtm :)

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @benmoss

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved
@vincepri
Copy link
Member

vincepri commented Apr 6, 2020

/milestone v0.3.4

@vincepri
Copy link
Member

vincepri commented Apr 9, 2020

 Error: Unexpected non-nil/non-zero extra argument at index 1:
  	<*errors.errorString>: &errors.errorString{s:"old nodes remain"}

I've been seeing this flake a lot, should we file an issue to look at it separately?
cc @sedefsavas @fabriziopandini

@vincepri
Copy link
Member

vincepri commented Apr 9, 2020

/test pull-cluster-api-capd-e2e

@sedefsavas
Copy link
Author

@vincepri Yes I am opening an issue about this. I think it is related to timeouts.

@sedefsavas
Copy link
Author

Waiting for @benmoss if he has any comments before squashing.
Thanks @benmoss and @vincepri for your comments.

Copy link

@benmoss benmoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of nits about wording and capitalizing etcd and just one real complaint about the nested for loops in ReconcileEtcdMembers

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved
// Wait for any delete in progress to complete before deleting another Machine
if controlPlane.HasDeletingMachine() {
return ctrl.Result{}, &capierrors.RequeueAfterError{RequeueAfter: deleteRequeueAfter}
}

// We don't want to health check at the beginning of this method to avoid blocking re-entrancy
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We don't want to health check at the beginning of this method to avoid blocking re-entrancy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove this before making sure moving healthcheck at the beginning of this function does not break anything related to machine deletion. We can follow up on this in a new issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, you want to leave the comment as a reminder that if things break it's because of the healthcheck?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we should move the comment back up there where it was, or remove it altogether

controlplane/kubeadm/internal/cluster.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved
@vincepri
Copy link
Member

also cc @JoelSpeed @enxebre

@sedefsavas
Copy link
Author

/test pull-cluster-api-capd-e2e

1 similar comment
@sedefsavas
Copy link
Author

/test pull-cluster-api-capd-e2e

@vincepri
Copy link
Member

@sedefsavas when you have a chance, can you make sure to rebase on master?

@vincepri
Copy link
Member

@ncdc @detiber if you have some spare time for a review, I'd love more 👀 on this one

@ncdc
Copy link
Contributor

ncdc commented Apr 13, 2020

I probably won't be able to dedicate a solid enough chunk of time to it ☹️

@detiber
Copy link
Member

detiber commented Apr 13, 2020

I can queue it up for a review tomorrow morning

@vincepri
Copy link
Member

/assign @detiber

@benmoss
Copy link

benmoss commented Apr 14, 2020

LGTM, I'll leave final signoff to @detiber

@benmoss
Copy link

benmoss commented Apr 14, 2020

/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Apr 14, 2020
Copy link
Member

@detiber detiber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few tangential items that could be handled as a separate items:

  • it might be good if we could differentiate some of the different failures/error conditions, so that we only reconcile membership if the error/failure is related to mismatched machine/etcd membership
  • we don't currently differentiate between an etcd static pod that hasn't yet been configured vs one that is crash looping:
    if err := checkStaticPodReadyCondition(pod); err != nil {
    // Nothing wrong here, etcd on this node is just not running.
    // If it's a true failure the healthcheck will fail since it won't have checked enough members.
    continue
    }
    // Only expect a member reports healthy if its pod is ready.
    // This fixes the known state where the control plane has a crash-looping etcd pod that is not part of the
    // etcd cluster.
    expectedMembers++

@detiber
Copy link
Member

detiber commented Apr 14, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2020
@k8s-ci-robot k8s-ci-robot merged commit 65ea933 into kubernetes-sigs:master Apr 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control Plane Provider should reconcile by removing etcd member for voluntary machine disruptions e.g deletion
7 participants