KCP waiting on remote cluster client hangs worker, substantially delaying cluster deletion #2612

randomvariable · 2020-03-10T09:14:03Z

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

In CAPV, I created a new cluster, followed by deleting it. Under the default of 1 worker, KCP controller will seemingly hang after the machines are deleted, eventually printing (I have added some extra log lines in my branch to trace what's going on):

[manager] I0310 08:30:06.521810       8 controller.go:171] controller-runtime/controller "msg"="Starting Controller"  "controller"="kubeadmcontrolplane"
[manager] I0310 08:30:06.521864       8 controller.go:190] controller-runtime/controller "msg"="Starting workers"  "controller"="kubeadmcontrolplane" "worker count"=1
[manager] I0310 08:31:05.532426       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:05.532479       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:05.532492       8 kubeadm_control_plane_controller.go:751] controllers/KubeadmControlPlane "msg"="Non control plane machines exist and must be removed before control plane machines are removed" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:35.685832       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:35.685914       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:31:48.434291       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:31:48.434861       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:31:49.435353       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:49.435468       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:02.432420       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:02.440652       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:32:03.441244       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:32:03.441342       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:16.436936       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:16.437515       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:32:17.437983       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:32:17.438080       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:19.961771       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="Get \"https://192.168.192.187:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\": read tcp 10.244.0.31:34044-\u003e192.168.192.187:6443: read: connection timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:19.962149       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="Get \"https://192.168.192.187:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\": read tcp 10.244.0.31:34044-\u003e192.168.192.187:6443: read: connection timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:37:20.962520       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:37:20.962614       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:37:20.962639       8 kubeadm_control_plane_controller.go:744] controllers/KubeadmControlPlane "msg"="no machines, ok to delete" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.953251       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: failed to create client for Cluster default/test: Get \"https://192.168.192.187:6443/api?timeout=32s\": dial tcp 192.168.192.187:6443: connect: no route to host" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.986688       8 kubeadm_control_plane_controller.go:179] controllers/KubeadmControlPlane "msg"="Failed to patch KubeadmControlPlane" "error"="kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"test\" not found" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.986769       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="[failed to create remote cluster client: failed to create client for Cluster default/test: Get \"https://192.168.192.187:6443/api?timeout=32s\": dial tcp 192.168.192.187:6443: connect: no route to host, kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"test\" not found]"  "controller"="kubeadmcontrolplane" "request"=

What did you expect to happen:
Resources to be deleted within 60s.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Increasing workers would be a suitable workaround for now. Finding a way to synchronise on a channel and cancel outstanding reconciliations on delete is maybe one potential optimisation.

Environment:

Cluster-api version: 0.3.0-rc.3
Minikube/KIND version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug

The text was updated successfully, but these errors were encountered:

detiber · 2020-03-10T09:21:35Z

I wonder if it would make sense to move the deferred call to r.updateStatus after the call to r.reconcileDelete in Reconcile.

detiber · 2020-03-10T09:22:23Z

We should probably also increase the default concurrency on KubeadmControlPlane reconciliation 😂

detiber · 2020-03-10T09:28:15Z

I don't necessarily think either of the fixes above are necessarily release blockers for v0.3.0, but are definitely candidates for a fast-follow patch release.

vincepri · 2020-03-10T14:07:29Z

/milestone v0.3.x

chuckha · 2020-03-10T18:47:27Z

Increasing default concurrency is a different problem and won't help in this case, but +1 on increasing it to 10 regardless.

The reason it won't help is that increasing a controller's concurrency will still only reconcile each object serially and not reconcile the same object concurrently. Since each cluster only has one KCP resource, the hanging is still a problem. This will, however, improve the situation when there are multiple control planes being managed by this controller.

I'm +1 on adding some concurrency within the control loop. Probably 1 go routine for each delete (== 1 go routine / node) sounds pretty reasonable. This would be similar to what we had originally (before health checks).

detiber · 2020-03-10T18:56:42Z

The concurrency comment was more of an aside. I do think we need to consider moving the call to updateStatus to a defer that is configured after reconcileDelete has been processed. If we still need a status update during reconcileDelete, we can probably add an additional defer for it to reconcileDelete where we can add more safety around failures communicating with the workload cluster.

wfernandes · 2020-03-16T18:54:08Z

I'm thinking of investigating this further. Has anyone seen this behavior with CAPA?
Asking because I do not have access to any vSphere infrastructure and was going to see if I can reproduce this with CAPA.

detiber · 2020-03-16T18:55:27Z

@wfernandes you should be able to replicate by attempting to delete prior to the first control plane Machine becoming ready.

wfernandes · 2020-03-16T19:51:08Z

/assign

wfernandes · 2020-03-20T17:07:38Z

/lifecycle active

wfernandes · 2020-03-20T18:04:27Z

It seems that this issue was fixed as part of this PR: #2708.
/close

k8s-ci-robot · 2020-03-20T18:04:29Z

@wfernandes: Closing this issue.

In response to this:

It seems that this issue was fixed as part of this PR: #2708.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 10, 2020

k8s-ci-robot added this to the v0.3.x milestone Mar 10, 2020

vincepri mentioned this issue Mar 10, 2020

🏃 Increase default KCP concurrency to 10 #2625

Merged

k8s-ci-robot assigned wfernandes Mar 16, 2020

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 20, 2020

k8s-ci-robot closed this as completed Mar 20, 2020

vincepri added area/control-plane Issues or PRs related to control-plane lifecycle management area/clusterctl Issues or PRs related to clusterctl and removed area/clusterctl Issues or PRs related to clusterctl labels Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KCP waiting on remote cluster client hangs worker, substantially delaying cluster deletion #2612

KCP waiting on remote cluster client hangs worker, substantially delaying cluster deletion #2612

randomvariable commented Mar 10, 2020

detiber commented Mar 10, 2020

detiber commented Mar 10, 2020

detiber commented Mar 10, 2020

vincepri commented Mar 10, 2020

chuckha commented Mar 10, 2020

detiber commented Mar 10, 2020

wfernandes commented Mar 16, 2020 •

edited

Loading

detiber commented Mar 16, 2020

wfernandes commented Mar 16, 2020

wfernandes commented Mar 20, 2020

wfernandes commented Mar 20, 2020

k8s-ci-robot commented Mar 20, 2020

KCP waiting on remote cluster client hangs worker, substantially delaying cluster deletion #2612

KCP waiting on remote cluster client hangs worker, substantially delaying cluster deletion #2612

Comments

randomvariable commented Mar 10, 2020

detiber commented Mar 10, 2020

detiber commented Mar 10, 2020

detiber commented Mar 10, 2020

vincepri commented Mar 10, 2020

chuckha commented Mar 10, 2020

detiber commented Mar 10, 2020

wfernandes commented Mar 16, 2020 • edited Loading

detiber commented Mar 16, 2020

wfernandes commented Mar 16, 2020

wfernandes commented Mar 20, 2020

wfernandes commented Mar 20, 2020

k8s-ci-robot commented Mar 20, 2020

wfernandes commented Mar 16, 2020 •

edited

Loading