Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP waiting on remote cluster client hangs worker, substantially delaying cluster deletion #2612

Closed
randomvariable opened this issue Mar 10, 2020 · 12 comments
Assignees
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Milestone

Comments

@randomvariable
Copy link
Member

What steps did you take and what happened:
[A clear and concise description on how to REPRODUCE the bug.]

In CAPV, I created a new cluster, followed by deleting it. Under the default of 1 worker, KCP controller will seemingly hang after the machines are deleted, eventually printing (I have added some extra log lines in my branch to trace what's going on):

[manager] I0310 08:30:06.521810       8 controller.go:171] controller-runtime/controller "msg"="Starting Controller"  "controller"="kubeadmcontrolplane"
[manager] I0310 08:30:06.521864       8 controller.go:190] controller-runtime/controller "msg"="Starting workers"  "controller"="kubeadmcontrolplane" "worker count"=1
[manager] I0310 08:31:05.532426       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:05.532479       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:05.532492       8 kubeadm_control_plane_controller.go:751] controllers/KubeadmControlPlane "msg"="Non control plane machines exist and must be removed before control plane machines are removed" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:35.685832       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:35.685914       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:31:48.434291       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:31:48.434861       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:31:49.435353       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:31:49.435468       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:02.432420       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:02.440652       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:32:03.441244       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:32:03.441342       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:16.436936       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="etcdserver: request timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:32:16.437515       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="etcdserver: request timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:32:17.437983       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:32:17.438080       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:19.961771       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="Get \"https://192.168.192.187:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\": read tcp 10.244.0.31:34044-\u003e192.168.192.187:6443: read: connection timed out" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:19.962149       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="Get \"https://192.168.192.187:6443/api/v1/nodes?labelSelector=node-role.kubernetes.io%2Fmaster%3D\": read tcp 10.244.0.31:34044-\u003e192.168.192.187:6443: read: connection timed out"  "controller"="kubeadmcontrolplane" "request"={"Namespace":"default","Name":"test"}
[manager] I0310 08:37:20.962520       8 kubeadm_control_plane_controller.go:733] controllers/KubeadmControlPlane "msg"="deleting control plane, checking for machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:37:20.962614       8 kubeadm_control_plane_controller.go:739] controllers/KubeadmControlPlane "msg"="filtering machines" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] I0310 08:37:20.962639       8 kubeadm_control_plane_controller.go:744] controllers/KubeadmControlPlane "msg"="no machines, ok to delete" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.953251       8 kubeadm_control_plane_controller.go:173] controllers/KubeadmControlPlane "msg"="Failed to update KubeadmControlPlane Status" "error"="failed to create remote cluster client: failed to create client for Cluster default/test: Get \"https://192.168.192.187:6443/api?timeout=32s\": dial tcp 192.168.192.187:6443: connect: no route to host" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.986688       8 kubeadm_control_plane_controller.go:179] controllers/KubeadmControlPlane "msg"="Failed to patch KubeadmControlPlane" "error"="kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"test\" not found" "cluster"="test" "kubeadmControlPlane"="test" "namespace"="default" 
[manager] E0310 08:37:22.986769       8 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="[failed to create remote cluster client: failed to create client for Cluster default/test: Get \"https://192.168.192.187:6443/api?timeout=32s\": dial tcp 192.168.192.187:6443: connect: no route to host, kubeadmcontrolplanes.controlplane.cluster.x-k8s.io \"test\" not found]"  "controller"="kubeadmcontrolplane" "request"=

What did you expect to happen:
Resources to be deleted within 60s.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Increasing workers would be a suitable workaround for now. Finding a way to synchronise on a channel and cancel outstanding reconciliations on delete is maybe one potential optimisation.

Environment:

  • Cluster-api version: 0.3.0-rc.3
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 10, 2020
@detiber
Copy link
Member

detiber commented Mar 10, 2020

I wonder if it would make sense to move the deferred call to r.updateStatus after the call to r.reconcileDelete in Reconcile.

@detiber
Copy link
Member

detiber commented Mar 10, 2020

We should probably also increase the default concurrency on KubeadmControlPlane reconciliation 😂

@detiber
Copy link
Member

detiber commented Mar 10, 2020

I don't necessarily think either of the fixes above are necessarily release blockers for v0.3.0, but are definitely candidates for a fast-follow patch release.

@vincepri
Copy link
Member

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Mar 10, 2020
@chuckha
Copy link
Contributor

chuckha commented Mar 10, 2020

Increasing default concurrency is a different problem and won't help in this case, but +1 on increasing it to 10 regardless.

The reason it won't help is that increasing a controller's concurrency will still only reconcile each object serially and not reconcile the same object concurrently. Since each cluster only has one KCP resource, the hanging is still a problem. This will, however, improve the situation when there are multiple control planes being managed by this controller.

I'm +1 on adding some concurrency within the control loop. Probably 1 go routine for each delete (== 1 go routine / node) sounds pretty reasonable. This would be similar to what we had originally (before health checks).

@detiber
Copy link
Member

detiber commented Mar 10, 2020

The concurrency comment was more of an aside. I do think we need to consider moving the call to updateStatus to a defer that is configured after reconcileDelete has been processed. If we still need a status update during reconcileDelete, we can probably add an additional defer for it to reconcileDelete where we can add more safety around failures communicating with the workload cluster.

@wfernandes
Copy link
Contributor

wfernandes commented Mar 16, 2020

I'm thinking of investigating this further. Has anyone seen this behavior with CAPA?
Asking because I do not have access to any vSphere infrastructure and was going to see if I can reproduce this with CAPA.

@detiber
Copy link
Member

detiber commented Mar 16, 2020

@wfernandes you should be able to replicate by attempting to delete prior to the first control plane Machine becoming ready.

@wfernandes
Copy link
Contributor

/assign

@wfernandes
Copy link
Contributor

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 20, 2020
@wfernandes
Copy link
Contributor

It seems that this issue was fixed as part of this PR: #2708.
/close

@k8s-ci-robot
Copy link
Contributor

@wfernandes: Closing this issue.

In response to this:

It seems that this issue was fixed as part of this PR: #2708.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vincepri vincepri added area/control-plane Issues or PRs related to control-plane lifecycle management area/clusterctl Issues or PRs related to clusterctl and removed area/clusterctl Issues or PRs related to clusterctl labels Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Issues or PRs related to control-plane lifecycle management kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Projects
None yet
Development

No branches or pull requests

6 participants