🏃[KCP] Recover from a manual machine deletion #2841

sedefsavas · 2020-04-01T16:30:40Z

What this PR does / why we need it:
This PR adds KCP logic to recover from a manual machine deletion. When an ETCD member with missing machine is detected, delete it from the ETCD members and from Kubeadm configmap.

Which issue(s) this PR fixes
Fixes #2818

sedefsavas · 2020-04-01T16:41:43Z

The only place we understand if the machine is missing is during ETCD healthcheck by comparing ETCD members and nodes in the workload cluster. So, doing this check every time ETCD health is checked makes sense IMO but need some feedback on this.

controlplane/kubeadm/internal/workload_cluster_etcd.go

sedefsavas · 2020-04-02T15:19:56Z

I opened another PR to perform health checks before scale up/down in reconcile logic.
#2849

Will continue on this PR ones it is merged.

controlplane/kubeadm/internal/workload_cluster_etcd.go

sedefsavas · 2020-04-06T18:11:44Z

@vincepri @sethp-nr Feel free to add any other comments you have for this PR. It is no longer waiting for the PR I mentioned above.

sethp-nr · 2020-04-06T20:25:51Z

I don't have any other comments – I would /lgtm but I also don't have the permission bit for that to do anything :)

vincepri · 2020-04-06T20:26:37Z

@sethp-nr you should be able to lgtm :)

vincepri

/assign @benmoss

controlplane/kubeadm/controllers/controller.go

controlplane/kubeadm/internal/workload_cluster_etcd.go

vincepri · 2020-04-06T20:29:35Z

/milestone v0.3.4

vincepri · 2020-04-09T18:18:44Z

 Error: Unexpected non-nil/non-zero extra argument at index 1:
  	<*errors.errorString>: &errors.errorString{s:"old nodes remain"}

I've been seeing this flake a lot, should we file an issue to look at it separately?
cc @sedefsavas @fabriziopandini

vincepri · 2020-04-09T18:18:55Z

/test pull-cluster-api-capd-e2e

sedefsavas · 2020-04-09T19:28:31Z

@vincepri Yes I am opening an issue about this. I think it is related to timeouts.

sedefsavas · 2020-04-09T19:32:18Z

Waiting for @benmoss if he has any comments before squashing.
Thanks @benmoss and @vincepri for your comments.

benmoss

Lots of nits about wording and capitalizing etcd and just one real complaint about the nested for loops in ReconcileEtcdMembers

controlplane/kubeadm/controllers/controller.go

benmoss · 2020-04-10T17:01:45Z

controlplane/kubeadm/controllers/scale.go

 	// Wait for any delete in progress to complete before deleting another Machine
 	if controlPlane.HasDeletingMachine() {
 		return ctrl.Result{}, &capierrors.RequeueAfterError{RequeueAfter: deleteRequeueAfter}
 	}

+	// We don't want to health check at the beginning of this method to avoid blocking re-entrancy


Suggested change

// We don't want to health check at the beginning of this method to avoid blocking re-entrancy

I don't think we should remove this before making sure moving healthcheck at the beginning of this function does not break anything related to machine deletion. We can follow up on this in a new issue.

I don't understand, you want to leave the comment as a reminder that if things break it's because of the healthcheck?

Either we should move the comment back up there where it was, or remove it altogether

controlplane/kubeadm/internal/cluster.go

controlplane/kubeadm/internal/workload_cluster_etcd.go

test/infrastructure/docker/e2e/docker_machine_remediation_test.go

vincepri · 2020-04-10T17:53:58Z

also cc @JoelSpeed @enxebre

sedefsavas · 2020-04-10T19:29:22Z

/test pull-cluster-api-capd-e2e

sedefsavas · 2020-04-10T20:38:58Z

/test pull-cluster-api-capd-e2e

vincepri · 2020-04-13T14:46:50Z

@sedefsavas when you have a chance, can you make sure to rebase on master?

controlplane/kubeadm/internal/workload_cluster_etcd.go

test/infrastructure/docker/e2e/docker_machine_remediation_test.go

vincepri · 2020-04-13T20:52:23Z

@ncdc @detiber if you have some spare time for a review, I'd love more 👀 on this one

ncdc · 2020-04-13T20:58:12Z

I probably won't be able to dedicate a solid enough chunk of time to it ☹️

detiber · 2020-04-13T21:01:54Z

I can queue it up for a review tomorrow morning

vincepri · 2020-04-14T15:31:55Z

/assign @detiber

benmoss · 2020-04-14T18:00:49Z

LGTM, I'll leave final signoff to @detiber

benmoss · 2020-04-14T18:03:19Z

/lgtm
/hold

detiber

A few tangential items that could be handled as a separate items:

it might be good if we could differentiate some of the different failures/error conditions, so that we only reconcile membership if the error/failure is related to mismatched machine/etcd membership

we don't currently differentiate between an etcd static pod that hasn't yet been configured vs one that is crash looping:

cluster-api/controlplane/kubeadm/internal/workload_cluster_etcd.go

Lines 69 to 77 in a19b557

    
           if err := checkStaticPodReadyCondition(pod); err != nil { 
        
           	// Nothing wrong here, etcd on this node is just not running. 
        
           	// If it's a true failure the healthcheck will fail since it won't have checked enough members. 
        
           	continue 
        
           } 
        
           // Only expect a member reports healthy if its pod is ready. 
        
           // This fixes the known state where the control plane has a crash-looping etcd pod that is not part of the 
        
           // etcd cluster. 
        
           expectedMembers++

controlplane/kubeadm/internal/workload_cluster_etcd.go

controlplane/kubeadm/controllers/controller.go

controlplane/kubeadm/controllers/scale.go

detiber · 2020-04-14T21:37:22Z

/hold cancel

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 1, 2020

k8s-ci-robot requested review from chuckha and vincepri April 1, 2020 16:30

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 1, 2020

vincepri reviewed Apr 1, 2020

View reviewed changes

vincepri reviewed Apr 2, 2020

View reviewed changes

controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 2, 2020

sedefsavas mentioned this pull request Apr 2, 2020

🏃[KCP] combine health checks of scale up and down #2849

Closed

sedefsavas force-pushed the kcpmachinedeletion branch from b0e67d6 to b5312bf Compare April 2, 2020 19:20

sethp-nr mentioned this pull request Apr 2, 2020

KubeadmControlPlane v2 iteration and robustness #2753

Closed

sedefsavas force-pushed the kcpmachinedeletion branch from b5312bf to 5f0b80a Compare April 2, 2020 21:04

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2020

sedefsavas changed the title ~~[WIP] 🏃[KCP] Recover from a manual machine deletion~~ 🏃[KCP] Recover from a manual machine deletion Apr 2, 2020

sedefsavas force-pushed the kcpmachinedeletion branch from 5f0b80a to 9c105fc Compare April 2, 2020 21:08

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2020

sedefsavas force-pushed the kcpmachinedeletion branch from 9c105fc to feade58 Compare April 2, 2020 21:12

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2020

sedefsavas force-pushed the kcpmachinedeletion branch from feade58 to ad05b58 Compare April 6, 2020 18:22

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 6, 2020

vincepri reviewed Apr 6, 2020

View reviewed changes

controlplane/kubeadm/controllers/controller.go Outdated Show resolved Hide resolved

controlplane/kubeadm/internal/workload_cluster_etcd.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned benmoss Apr 6, 2020

benmoss reviewed Apr 10, 2020

View reviewed changes

sedefsavas force-pushed the kcpmachinedeletion branch from d33801b to 894b9b8 Compare April 10, 2020 18:34

sedefsavas force-pushed the kcpmachinedeletion branch from 894b9b8 to a573a2c Compare April 13, 2020 16:09

vincepri reviewed Apr 13, 2020

View reviewed changes

[KCP] Recover from a manual machine deletion

3acda52

sedefsavas force-pushed the kcpmachinedeletion branch from 3a3ae47 to 3acda52 Compare April 13, 2020 18:57

[e2e] increase timeouts.

538edf6

k8s-ci-robot assigned detiber Apr 14, 2020

vincepri mentioned this pull request Apr 14, 2020

Self healing control plane: Enable MHC to watch control plane machines #2836

Closed

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Apr 14, 2020

detiber reviewed Apr 14, 2020

View reviewed changes

controlplane/kubeadm/internal/workload_cluster_etcd.go Show resolved Hide resolved

controlplane/kubeadm/controllers/controller.go Show resolved Hide resolved

controlplane/kubeadm/controllers/scale.go Show resolved Hide resolved

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2020

k8s-ci-robot merged commit 65ea933 into kubernetes-sigs:master Apr 14, 2020

benmoss mentioned this pull request Aug 26, 2020

etcd health check does not return the error #3519

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🏃[KCP] Recover from a manual machine deletion #2841

🏃[KCP] Recover from a manual machine deletion #2841

sedefsavas commented Apr 1, 2020

sedefsavas commented Apr 1, 2020

sedefsavas commented Apr 2, 2020

sedefsavas commented Apr 6, 2020

sethp-nr commented Apr 6, 2020

vincepri commented Apr 6, 2020

vincepri left a comment

vincepri commented Apr 6, 2020

vincepri commented Apr 9, 2020

vincepri commented Apr 9, 2020

sedefsavas commented Apr 9, 2020

sedefsavas commented Apr 9, 2020

benmoss left a comment

benmoss Apr 10, 2020

sedefsavas Apr 10, 2020

benmoss Apr 13, 2020

vincepri Apr 13, 2020

vincepri commented Apr 10, 2020

sedefsavas commented Apr 10, 2020

sedefsavas commented Apr 10, 2020

vincepri commented Apr 13, 2020

vincepri commented Apr 13, 2020

ncdc commented Apr 13, 2020

detiber commented Apr 13, 2020

vincepri commented Apr 14, 2020

benmoss commented Apr 14, 2020

benmoss commented Apr 14, 2020

detiber left a comment

detiber commented Apr 14, 2020

	if err := checkStaticPodReadyCondition(pod); err != nil {
	// Nothing wrong here, etcd on this node is just not running.
	// If it's a true failure the healthcheck will fail since it won't have checked enough members.
	continue
	}
	// Only expect a member reports healthy if its pod is ready.
	// This fixes the known state where the control plane has a crash-looping etcd pod that is not part of the
	// etcd cluster.
	expectedMembers++

🏃[KCP] Recover from a manual machine deletion #2841

🏃[KCP] Recover from a manual machine deletion #2841

Conversation

sedefsavas commented Apr 1, 2020

sedefsavas commented Apr 1, 2020

sedefsavas commented Apr 2, 2020

sedefsavas commented Apr 6, 2020

sethp-nr commented Apr 6, 2020

vincepri commented Apr 6, 2020

vincepri left a comment

Choose a reason for hiding this comment

vincepri commented Apr 6, 2020

vincepri commented Apr 9, 2020

vincepri commented Apr 9, 2020

sedefsavas commented Apr 9, 2020

sedefsavas commented Apr 9, 2020

benmoss left a comment

Choose a reason for hiding this comment

benmoss Apr 10, 2020

Choose a reason for hiding this comment

sedefsavas Apr 10, 2020

Choose a reason for hiding this comment

benmoss Apr 13, 2020

Choose a reason for hiding this comment

vincepri Apr 13, 2020

Choose a reason for hiding this comment

vincepri commented Apr 10, 2020

sedefsavas commented Apr 10, 2020

sedefsavas commented Apr 10, 2020

vincepri commented Apr 13, 2020

vincepri commented Apr 13, 2020

ncdc commented Apr 13, 2020

detiber commented Apr 13, 2020

vincepri commented Apr 14, 2020

benmoss commented Apr 14, 2020

benmoss commented Apr 14, 2020

detiber left a comment

Choose a reason for hiding this comment

detiber commented Apr 14, 2020