Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KCP reconcile hang the when workload cluster API Server is unreachable #8948

Closed
Levi080513 opened this issue Jul 3, 2023 · 20 comments · Fixed by #9028
Closed

KCP reconcile hang the when workload cluster API Server is unreachable #8948

Levi080513 opened this issue Jul 3, 2023 · 20 comments · Fixed by #9028
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Levi080513
Copy link
Contributor

Levi080513 commented Jul 3, 2023

What steps did you take and what happened?

When upgrading KCP, i found that KCP reconcile hang. Even if the annotation is manually updated to trigger reconcile, there is no reconcile log printed.

KCP log

2023-06-29T18:33:38.306571117+08:00 stderr F E0629 10:33:38.300120       1 controller.go:329] "Reconciler error" err="[cannot get remote client to workload cluster: failed to create cluster accessor: error checking if we're running on workload cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": context deadline exceeded, failed to create remote cluster client: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)]" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=4f8f9888-0cbe-4f56-8fe8-2407662dbaa1
2023-06-29T18:33:38.306582944+08:00 stderr F I0629 10:33:38.300746       1 controller.go:276] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5696188f-a423-42be-aca0-22680ef26ceb Cluster="default/sks-mgmt"
2023-06-29T18:33:53.808921324+08:00 stderr F E0629 10:33:53.807069       1 controller.go:199] "Failed to update KubeadmControlPlane Status" err="failed to create remote cluster client: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": http2: client connection lost" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5696188f-a423-42be-aca0-22680ef26ceb Cluster="default/sks-mgmt"
2023-06-29T18:33:53.811961844+08:00 stderr F E0629 10:33:53.809752       1 controller.go:329] "Reconciler error" err="[cannot get remote client to workload cluster: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers), failed to create remote cluster client: failed to create cluster accessor: error creating dynamic rest mapper for remote cluster \"default/sks-mgmt\": Get \"https://10.255.125.6:6443/api?timeout=10s\": http2: client connection lost]" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5696188f-a423-42be-aca0-22680ef26ceb
2023-06-29T18:38:32.001485567+08:00 stderr F I0629 10:38:32.001239       1 controller.go:276] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5e663c6a-00cd-4d31-954f-336a10547d52 Cluster="default/sks-mgmt"
2023-06-29T18:38:33.763802075+08:00 stderr F I0629 10:38:33.762365       1 cluster_cache_tracker.go:298] "Creating cluster accessor for cluster \"default/sks-mgmt\" with the regular apiserver endpoint \"https://10.255.125.6:6443\"" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/sks-mgmt-controlplane" namespace="default" name="sks-mgmt-controlplane" reconcileID=5e663c6a-00cd-4d31-954f-336a10547d52 Cluster="default/sks-mgmt"
2023-06-29T18:39:26.54150753+08:00 stderr F W0629 10:39:26.540559       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:26.541584905+08:00 stderr F I0629 10:39:26.540763       1 trace.go:219] Trace[1245889233]: "Reflector ListAndWatch" name:k8s.io/[email protected]/tools/cache/reflector.go:169 (29-Jun-2023 10:39:16.536) (total time: 10003ms):
2023-06-29T18:39:26.541600266+08:00 stderr F Trace[1245889233]: ---"Objects listed" error:Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) 10003ms (10:39:26.540)
2023-06-29T18:39:26.541605636+08:00 stderr F Trace[1245889233]: [10.003856562s] [10.003856562s] END
2023-06-29T18:39:26.54161313+08:00 stderr F E0629 10:39:26.540925       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:31.421271219+08:00 stderr F W0629 10:39:31.420913       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": http2: client connection lost
2023-06-29T18:39:31.421356584+08:00 stderr F E0629 10:39:31.420985       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": http2: client connection lost
2023-06-29T18:39:31.421402287+08:00 stderr F E0629 10:39:31.420994       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169:
Failed to watch *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?allowWatchBookmarks=true&resourceVersion=35679197&timeout=10s&timeoutSeconds=413&watch=true": http2: client connection lost - error from a previous attempt: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:42.793292892+08:00 stderr F W0629 10:39:42.792403       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:42.793331013+08:00 stderr F I0629 10:39:42.792878       1 trace.go:219] Trace[664855972]: "Reflector ListAndWatch" name:k8s.io/[email protected]/tools/cache/reflector.go:169 (29-Jun-2023 10:39:32.791) (total time: 10001ms):
2023-06-29T18:39:42.793336977+08:00 stderr F Trace[664855972]: ---"Objects listed" error:Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 10001ms (10:39:42.792)
2023-06-29T18:39:42.793341802+08:00 stderr F Trace[664855972]: [10.001440542s] [10.001440542s] END
2023-06-29T18:39:42.793355345+08:00 stderr F E0629 10:39:42.792914       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:44.285365522+08:00 stderr F W0629 10:39:44.284282       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:44.285415251+08:00 stderr F I0629 10:39:44.284479       1 trace.go:219] Trace[333138094]: "Reflector ListAndWatch" name:k8s.io/[email protected]/tools/cache/reflector.go:169 (29-Jun-2023 10:39:34.282) (total time: 10001ms):
2023-06-29T18:39:44.285421731+08:00 stderr F Trace[333138094]: ---"Objects listed" error:Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 10001ms (10:39:44.284)
2023-06-29T18:39:44.2854264+08:00 stderr F Trace[333138094]: [10.001639339s] [10.001639339s] END
2023-06-29T18:39:44.28543114+08:00 stderr F E0629 10:39:44.284504       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:49.43914295+08:00 stderr F W0629 10:39:49.438066       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:49.441353046+08:00 stderr F E0629 10:39:49.438156       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.58127205+08:00 stderr F W0629 10:39:55.579978       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581324556+08:00 stderr F E0629 10:39:55.580061       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581331354+08:00 stderr F W0629 10:39:55.580109       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581337158+08:00 stderr F E0629 10:39:55.580196       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:04.806961247+08:00 stderr F W0629 10:40:04.802654       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:04.807016783+08:00 stderr F E0629 10:40:04.802774       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:14.017608903+08:00 stderr F W0629 10:40:14.017353       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10
Failed to watch *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?allowWatchBookmarks=true&resourceVersion=35679197&timeout=10s&timeoutSeconds=4
13&watch=true": http2: client connection lost - error from a previous attempt: net/http: request canceled (Client.Timeout exceeded while awaiting he
aders)
2023-06-29T18:39:42.793292892+08:00 stderr F W0629 10:39:42.792403       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169:
failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: request canceled while waiting
 for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:42.793331013+08:00 stderr F I0629 10:39:42.792878       1 trace.go:219] Trace[664855972]: "Reflector ListAndWatch" name:k8s.io/clie
[email protected]/tools/cache/reflector.go:169 (29-Jun-2023 10:39:32.791) (total time: 10001ms):
2023-06-29T18:39:42.793336977+08:00 stderr F Trace[664855972]: ---"Objects listed" error:Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion
=35679197&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 10001ms (10:39:42.7
92)
2023-06-29T18:39:42.793341802+08:00 stderr F Trace[664855972]: [10.001440542s] [10.001440542s] END
2023-06-29T18:39:42.793355345+08:00 stderr F E0629 10:39:42.792914       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169:
Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": net/http: requ
est canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:44.285365522+08:00 stderr F W0629 10:39:44.284282       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169:
failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10
s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:44.285415251+08:00 stderr F I0629 10:39:44.284479       1 trace.go:219] Trace[333138094]: "Reflector ListAndWatch" name:k8s.io/clie
[email protected]/tools/cache/reflector.go:169 (29-Jun-2023 10:39:34.282) (total time: 10001ms):
2023-06-29T18:39:44.285421731+08:00 stderr F Trace[333138094]: ---"Objects listed" error:Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.
io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while
awaiting headers) 10001ms (10:39:44.284)
2023-06-29T18:39:44.2854264+08:00 stderr F Trace[333138094]: [10.001639339s] [10.001639339s] END
2023-06-29T18:39:44.28543114+08:00 stderr F E0629 10:39:44.284504       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169: F
ailed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=
500&resourceVersion=0&timeout=10s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2023-06-29T18:39:49.43914295+08:00 stderr F W0629 10:39:49.438066       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: f
ailed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no r
oute to host
2023-06-29T18:39:49.441353046+08:00 stderr F E0629 10:39:49.438156       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169:
Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.25
5.125.6:6443: connect: no route to host
2023-06-29T18:39:55.58127205+08:00 stderr F W0629 10:39:55.579978       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169: f
ailed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10s
": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581324556+08:00 stderr F E0629 10:39:55.580061       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169:
Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit
=500&resourceVersion=0&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:39:55.581331354+08:00 stderr F W0629 10:39:55.580109       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169:
failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no
route to host
2023-06-29T18:39:55.581337158+08:00 stderr F E0629 10:39:55.580196       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169:
Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.25
5.125.6:6443: connect: no route to host
2023-06-29T18:40:04.806961247+08:00 stderr F W0629 10:40:04.802654       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169:
failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.255.125.6:6443: connect: no
route to host
2023-06-29T18:40:04.807016783+08:00 stderr F E0629 10:40:04.802774       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169:
Failed to watch *v1.Node: failed to list *v1.Node: Get "https://10.255.125.6:6443/api/v1/nodes?resourceVersion=35679197&timeout=10s": dial tcp 10.25
5.125.6:6443: connect: no route to host
2023-06-29T18:40:14.017608903+08:00 stderr F W0629 10:40:14.017353       1 reflector.go:424] k8s.io/[email protected]/tools/cache/reflector.go:169:
failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit=500&resourceVersion=0&timeout=10
s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:14.017674617+08:00 stderr F E0629 10:40:14.017427       1 reflector.go:140] k8s.io/[email protected]/tools/cache/reflector.go:169:
Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: Get "https://10.255.125.6:6443/apis/rbac.authorization.k8s.io/v1/clusterroles?limit
=500&resourceVersion=0&timeout=10s": dial tcp 10.255.125.6:6443: connect: no route to host
2023-06-29T18:40:28.988010911+08:00 stderr F E0629 10:40:28.987542       1 cluster_cache_tracker.go:602] "remote/ClusterCacheTracker: Error health c
hecking cluster" err="Get \"https://10.255.125.6:6443/?timeout=5s\": context deadline exceeded" Cluster="default/sks-mgmt"

I killed the KCP process through kill -SIGQUIT and found that KCP reconcile stuck in Workload.EnsureResource and would not exit.

2023-06-29T19:38:59.267600649+08:00 stderr F
2023-06-29T19:38:59.267604842+08:00 stderr F goroutine 484 [select]:
2023-06-29T19:38:59.267608916+08:00 stderr F runtime.gopark(0xc000ea8bc0?, 0x2?, 0xa?, 0xa9?, 0xc000ea8b9c?)
2023-06-29T19:38:59.267613041+08:00 stderr F    runtime/proc.go:363 +0xd6 fp=0xc000ea8a20 sp=0xc000ea8a00 pc=0x43cb76
2023-06-29T19:38:59.267616605+08:00 stderr F runtime.selectgo(0xc000ea8bc0, 0xc000ea8b98, 0x0?, 0x0, 0x2?, 0x1)
2023-06-29T19:38:59.267620318+08:00 stderr F    runtime/select.go:328 +0x7bc fp=0xc000ea8b60 sp=0xc000ea8a20 pc=0x44ca1c
2023-06-29T19:38:59.267608916+08:00 stderr F runtime.gopark(0xc000ea8bc0?, 0x2?, 0xa?, 0xa9?, 0xc000ea8b9c?)
2023-06-29T19:38:59.267613041+08:00 stderr F    runtime/proc.go:363 +0xd6 fp=0xc000ea8a20 sp=0xc000ea8a00 pc=0x43cb76
2023-06-29T19:38:59.267616605+08:00 stderr F runtime.selectgo(0xc000ea8bc0, 0xc000ea8b98, 0x0?, 0x0, 0x2?, 0x1)
2023-06-29T19:38:59.267620318+08:00 stderr F    runtime/select.go:328 +0x7bc fp=0xc000ea8b60 sp=0xc000ea8a20 pc=0x44ca1c
2023-06-29T19:38:59.2676244+08:00 stderr F k8s.io/apimachinery/pkg/util/wait.WaitForWithContext({0x1f4b530, 0xc000c8e0c0}, 0xc000479f98, 0x108734a?)
2023-06-29T19:38:59.267628225+08:00 stderr F    k8s.io/[email protected]/pkg/util/wait/wait.go:660 +0xe5 fp=0xc000ea8bf8 sp=0xc000ea8b60 pc=0x1087185
2023-06-29T19:38:59.267632189+08:00 stderr F k8s.io/apimachinery/pkg/util/wait.poll({0x1f4b530, 0xc000c8e0c0}, 0x68?, 0x1086345?, 0xc0008095e0?)
2023-06-29T19:38:59.267650652+08:00 stderr F    k8s.io/[email protected]/pkg/util/wait/wait.go:596 +0x9a fp=0xc000ea8c38 sp=0xc000ea8bf8 pc=0x108703a
2023-06-29T19:38:59.267654508+08:00 stderr F k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext({0x1f4b530, 0xc000c8e0c0}, 0x20?, 0x2e19a80?)
2023-06-29T19:38:59.267658202+08:00 stderr F    k8s.io/[email protected]/pkg/util/wait/wait.go:547 +0x49 fp=0xc000ea8c78 sp=0xc000ea8c38 pc=0x1086f49
2023-06-29T19:38:59.267661921+08:00 stderr F k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil(0xc0012ad9b0?, 0x0?, 0xc000720d10?)
2023-06-29T19:38:59.267665694+08:00 stderr F    k8s.io/[email protected]/pkg/util/wait/wait.go:538 +0x7c fp=0xc000ea8ce8 sp=0xc000ea8c78 pc=0x1086e7c
2023-06-29T19:38:59.267669273+08:00 stderr F k8s.io/client-go/tools/cache.WaitForCacheSync(0xc0007305a0?, {0xc00055a470, 0x1, 0x1})
2023-06-29T19:38:59.267673449+08:00 stderr F    k8s.io/[email protected]/tools/cache/shared_informer.go:288 +0x88 fp=0xc000ea8d38 sp=0xc000ea8ce8 pc=0x14b5988
2023-06-29T19:38:59.268143623+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/cache/internal.(*specificInformersMap).Get(0x403e56?, {0x1f4b5d8, 0xc0011376e0}, {{0x1c82d57, 0x19}, {0x1c613fc, 0x2}, {0x1869344, 0xb}}, {0x1f378c0, ...})
2023-06-29T19:38:59.268255075+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/cache/internal/informers_map.go:208 +0x1dd fp=0xc000ea8de8 sp=0xc000ea8d38 pc=0x14c20dd
2023-06-29T19:38:59.268262915+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/cache/internal.(*InformersMap).Get(0xc000720ed8?, {0x1f4b5d8?, 0xc0011376e0?}, {{0x1c82d57, 0x19}, {0x1c613fc, 0x2}, {0x1869344, 0xb}}, {0x1f378c0, ...})
2023-06-29T19:38:59.268267921+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/cache/internal/deleg_map.go:106 +0x1fc fp=0xc000ea8e50 sp=0xc000ea8de8 pc=0x14c10fc
2023-06-29T19:38:59.268280169+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/cache.(*informerCache).Get(0xc00011de58, {0x1f4b5d8, 0xc0011376e0}, {{0x1c6cada?, 0x19?}, {0x1c75dbd?, 0x2?}}, {0x1f623b0?, 0xc000169900?}, {0x0, ...})
2023-06-29T19:38:59.268288586+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/cache/informer_cache.go:60 +0x1a5 fp=0xc000ea8f58 sp=0xc000ea8e50 pc=0x14c7105
2023-06-29T19:38:59.268304866+08:00 stderr F sigs.k8s.io/cluster-api/controllers/remote.(*stoppableCache).Get(0x2e19a80?, {0x1f4b5d8?, 0xc0011376e0?}, {{0x1c6cada?, 0xc000721028?}, {0x1c75dbd?, 0x10?}}, {0x1f623b0?, 0xc000169900?}, {0x0, ...})
2023-06-29T19:38:59.268310266+08:00 stderr F    <autogenerated>:1 +0x86 fp=0xc000ea8fc8 sp=0xc000ea8f58 pc=0x15a0e06
2023-06-29T19:38:59.268314821+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/client.(*delegatingReader).Get(0xc000d22ec0, {0x1f4b5d8, 0xc0011376e0}, {{0x1c6cada?, 0xc0007210c8?}, {0x1c75dbd?, 0x1bca0c0?}}, {0x1f623b0?, 0xc000169900?}, {0x0, ...})
2023-06-29T19:38:59.268319147+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/client/split.go:132 +0x1ab fp=0xc000ea9068 sp=0xc000ea8fc8 pc=0x149c62b
2023-06-29T19:38:59.26835893+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/client.(*delegatingClient).Get(0xc000b5c700?, {0x1f4b5d8?, 0xc0011376e0?}, {{0x1c6cada?, 0x40d8bf?}, {0x1c75dbd?, 0x80?}}, {0x1f623b0?, 0xc000169900?}, {0x0, ...})
2023-06-29T19:38:59.268376782+08:00 stderr F    <autogenerated>:1 +0x86 fp=0xc000ea90d8 sp=0xc000ea9068 pc=0x14a5ac6
2023-06-29T19:38:59.268407267+08:00 stderr F sigs.k8s.io/cluster-api/controlplane/kubeadm/internal.(*Workload).EnsureResource(0xc000a12180, {0x1f4b5d8, 0xc0011376e0}, {0x1f623b0, 0xc0001697c0})
2023-06-29T19:38:59.268420905+08:00 stderr F    sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/workload_cluster_rbac.go:55 +0x103 fp=0xc000ea91b8 sp=0xc000ea90d8 pc=0x182c143
2023-06-29T19:38:59.268432866+08:00 stderr F sigs.k8s.io/cluster-api/controlplane/kubeadm/internal.(*Workload).AllowBootstrapTokensToGetNodes(0xc0000dd530?, {0x1f4b5d8, 0xc0011376e0})
2023-06-29T19:38:59.268448472+08:00 stderr F    sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/workload_cluster_rbac.go:71 +0x1aa fp=0xc000ea9210 sp=0xc000ea91b8 pc=0x182c56a
2023-06-29T19:38:59.26849179+08:00 stderr F sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).reconcile(0xc0001a1d60, {0x1f4b5d8, 0xc0011376e0}, 0xc00099f860, 0xc00060a700)
2023-06-29T19:38:59.268502139+08:00 stderr F    sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:421 +0x168a fp=0xc000ea9958 sp=0xc000ea9210 pc=0x183540a
2023-06-29T19:38:59.268565527+08:00 stderr F sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers.(*KubeadmControlPlaneReconciler).Reconcile(0xc0001a1d60, {0x1f4b5d8, 0xc001137530}, {{{0xc000be5416?, 0x10?}, {0xc000478030?, 0x40dc07?}}})
2023-06-29T19:38:59.268572212+08:00 stderr F    sigs.k8s.io/cluster-api/controlplane/kubeadm/internal/controllers/controller.go:233 +0x7f0 fp=0xc000ea9cc0 sp=0xc000ea9958 pc=0x1832eb0
2023-06-29T19:38:59.268598232+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1f4b5d8?, {0x1f4b5d8?, 0xc001137530?}, {{{0xc000be5416?, 0x1961da0?}, {0xc000478030?, 0x0?}}})
2023-06-29T19:38:59.268613358+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122 +0xc8 fp=0xc000ea9d60 sp=0xc000ea9cc0 pc=0x14d4128
2023-06-29T19:38:59.268647867+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0001a1e00, {0x1f4b530, 0xc000050a40}, {0x1a71d80?, 0xc001144a00?})
2023-06-29T19:38:59.268664604+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323 +0x38f fp=0xc000ea9ee0 sp=0xc000ea9d60 pc=0x14d628f
2023-06-29T19:38:59.268680413+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0001a1e00, {0x1f4b530, 0xc000050a40})
2023-06-29T19:38:59.268699436+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274 +0x1d9 fp=0xc000ea9f80 sp=0xc000ea9ee0 pc=0x14d5a79
2023-06-29T19:38:59.268713362+08:00 stderr F sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
2023-06-29T19:38:59.268724902+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235 +0x85 fp=0xc000ea9fe0 sp=0xc000ea9f80 pc=0x14d54c5
2023-06-29T19:38:59.268733815+08:00 stderr F runtime.goexit()
2023-06-29T19:38:59.268746223+08:00 stderr F    runtime/asm_amd64.s:1594 +0x1 fp=0xc000ea9fe8 sp=0xc000ea9fe0 pc=0x46ddc1
2023-06-29T19:38:59.268763535+08:00 stderr F created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
2023-06-29T19:38:59.268769063+08:00 stderr F    sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:231 +0x333

When I was analyzing this problem, i found that KCP obtains the workload Cluster Client through ClusterCacheTracker.GetClient. When using this client to get resource for the first time, it will start Informer to cache resources, and the getting request will not return until the informer.HasSynced=true.

// Get will create a new Informer and add it to the map of specificInformersMap if none exists.  Returns
// the Informer from the map.
func (ip *specificInformersMap) Get(ctx context.Context, gvk schema.GroupVersionKind, obj runtime.Object) (bool, *MapEntry, error) {
	// Return the informer if it is found
	i, started, ok := func() (*MapEntry, bool, bool) {
		ip.mu.RLock()
		defer ip.mu.RUnlock()
		i, ok := ip.informersByGVK[gvk]
		return i, ip.started, ok
	}()

	if !ok {
		var err error
		if i, started, err = ip.addInformerToMap(gvk, obj); err != nil {
			return started, nil, err
		}
	}

	if started && !i.Informer.HasSynced() {
		// Wait for it to sync before returning the Informer so that folks don't read from a stale cache.
		if !cache.WaitForCacheSync(ctx.Done(), i.Informer.HasSynced) {
			return started, nil, apierrors.NewTimeoutError(fmt.Sprintf("failed waiting for %T Informer to sync", obj), 0)
		}
	}

	return started, i, nil
}

When the workload cluster API Server is unreachable, the getting request will be blocked due to infromer.HasSynced=false.
And after ClusterCacheTracker.healthCheckCluster failed, the informer cache will no longer try to synchronize, and infromer.HasSynced will always be false.

func (t *ClusterCacheTracker) healthCheckCluster(ctx context.Context, in *healthCheckInput) {
// populate optional params for healthCheckInput
in.setDefaults()
unhealthyCount := 0
// This gets us a client that can make raw http(s) calls to the remote apiserver. We only need to create it once
// and we can reuse it inside the polling loop.
codec := runtime.NoopEncoder{Decoder: scheme.Codecs.UniversalDecoder()}
cfg := rest.CopyConfig(in.cfg)
cfg.NegotiatedSerializer = serializer.NegotiatedSerializerWrapper(runtime.SerializerInfo{Serializer: codec})
restClient, restClientErr := rest.UnversionedRESTClientFor(cfg)
runHealthCheckWithThreshold := func() (bool, error) {
if restClientErr != nil {
return false, restClientErr
}
cluster := &clusterv1.Cluster{}
if err := t.client.Get(ctx, in.cluster, cluster); err != nil {
if apierrors.IsNotFound(err) {
// If the cluster can't be found, we should delete the cache.
return false, err
}
// Otherwise, requeue.
return false, nil
}
if !cluster.Status.InfrastructureReady || !conditions.IsTrue(cluster, clusterv1.ControlPlaneInitializedCondition) {
// If the infrastructure or control plane aren't marked as ready, we should requeue and wait.
return false, nil
}
if _, ok := t.loadAccessor(in.cluster); !ok {
// If there is no accessor but the cluster is locked, we're probably in the middle of the cluster accessor
// creation and we should requeue the health check until it's done.
if ok := t.clusterLock.TryLock(in.cluster); !ok {
t.log.V(4).Info("Waiting for cluster to be unlocked. Requeuing health check")
return false, nil
}
t.clusterLock.Unlock(in.cluster)
// Cache for this cluster has already been cleaned up.
// Nothing to do, so return true.
return true, nil
}
// An error here means there was either an issue connecting or the API returned an error.
// If no error occurs, reset the unhealthy counter.
_, err := restClient.Get().AbsPath(in.path).Timeout(in.requestTimeout).DoRaw(ctx)
if err != nil {
if apierrors.IsUnauthorized(err) {
// Unauthorized means that the underlying kubeconfig is not authorizing properly anymore, which
// usually is the result of automatic kubeconfig refreshes, meaning that we have to throw away the
// clusterAccessor and rely on the creation of a new one (with a refreshed kubeconfig)
return false, err
}
unhealthyCount++
} else {
unhealthyCount = 0
}
if unhealthyCount >= in.unhealthyThreshold {
// Cluster is now considered unhealthy.
return false, err
}
return false, nil
}
err := wait.PollImmediateUntil(in.interval, runHealthCheckWithThreshold, ctx.Done())
// An error returned implies the health check has failed a sufficient number of
// times for the cluster to be considered unhealthy
// NB. we are ignoring ErrWaitTimeout because this error happens when the channel is close, that in this case
// happens when the cache is explicitly stopped.
if err != nil && err != wait.ErrWaitTimeout {
t.log.Error(err, "Error health checking cluster", "Cluster", klog.KRef(in.cluster.Namespace, in.cluster.Name))
t.deleteAccessor(ctx, in.cluster)
}
}

func (t *ClusterCacheTracker) deleteAccessor(_ context.Context, cluster client.ObjectKey) {
t.clusterAccessorsLock.Lock()
defer t.clusterAccessorsLock.Unlock()
a, exists := t.clusterAccessors[cluster]
if !exists {
return
}
log := t.log.WithValues("Cluster", klog.KRef(cluster.Namespace, cluster.Name))
log.V(2).Info("Deleting clusterAccessor")
log.V(4).Info("Stopping cache")
a.cache.Stop()
log.V(4).Info("Cache stopped")
delete(t.clusterAccessors, cluster)
}

After that, even if the workload cluster API Server is reachable, the getting request will always be blocked and case KCP reconcile hang.

What did you expect to happen?

When the workload cluster API server is accessible, KCP can reconcile normally.

Cluster API version

v1.4.3

Kubernetes version

v1.24.13

Anything else you would like to add?

This issue can be easily reproduced by the following method.

  1. Pick this commit smartxworks@1991009 and update the KCP Deployment.
  2. When kcp first prints the log start wait for bug, inject a fault to ensure that the workload cluster apiserver is unreachable
  3. When kcp prints the log remote/ClusterCacheTracker: Error health checking cluster, the problem can be reproduced. Even if the workload cluster apiserver can be accessed later, KCP will not reconcile.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 3, 2023
@killianmuldoon
Copy link
Contributor

/triage accepted

I'll take a look at reproducing this. Thanks for reporting, and thanks for supplying clear steps to reproduce this issue.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 3, 2023
@Levi080513
Copy link
Contributor Author

@killianmuldoon Have you tried to reproduce the problem?

@jessehu
Copy link
Contributor

jessehu commented Jul 7, 2023

@killianmuldoon we can easily reproduce this issue in 1 Control Plane cluster by injecting a fault to ensure that the workload cluster apiserver is unreachable

@Levi080513
Copy link
Contributor Author

This problem does not seem to be reproduced in the latest code.
When workload cluster API Server is unreachable, ClusterCacheTracker.healthCheckCluster failed will not cause stopped cache. Is this expected ?
Maybe it has something to do with this change?

if err != nil && !wait.Interrupted(err) {

@sbueringer
Copy link
Member

Hm no this code just ignores errors if the cache was explicitly shutdown. But we only do this after Cluster deletion or when we couldn't sync in createClient

@Levi080513
Copy link
Contributor Author

smartxworks@2f1d311
I added log printing and retested this scenario, when unhealthyCount >= in.unhealthyThreshold, the error is like this

Get "https://10.255.26.34:6443/?timeout=5s": context deadline exceeded
true
Get "https://10.255.26.34:6443/?timeout=5s": context deadline exceeded
true

It looks like the error is indeed being ignored.
This error should be redefined.

@sbueringer
Copy link
Member

Thx! I think now I understand it.

The problem is that the wail utils in k8s.io/apimachinery were changed with v0.27 (we picked it up with CR v0.15)

// ErrWaitTimeout is returned when the condition was not satisfied in time.
//
// Deprecated: This type will be made private in favor of Interrupted()
// for checking errors or ErrorInterrupted(err) for returning a wrapped error.
var ErrWaitTimeout = ErrorInterrupted(errors.New("timed out waiting for the condition"))

Previously we just checked for this error. But now we use the util func they recommended. And that one is capturing more:

func Interrupted(err error) bool {
	switch {
	case errors.Is(err, errWaitTimeout),
		errors.Is(err, context.Canceled),
		errors.Is(err, context.DeadlineExceeded):
		return true
	default:
		return false
	}
}

@sbueringer
Copy link
Member

But it doesn't make any sense because this change is only in main & release-1.5. It doesn't exist on release-1.4

@sbueringer
Copy link
Member

I wonder if we should just always delete the accessor

@sbueringer
Copy link
Member

For release-1.5 and main Interrupted is definitely the problem:
image

@sbueringer
Copy link
Member

This should fix it for main and release-1.5 #9025

I'll take a look at release-1.4 now

@sbueringer
Copy link
Member

@Levi080513 @jessehu Just to get as much data as possible. With which versions of CAPI did you encounter this issue?
(just so I know on which branches I should investigate this)

@Levi080513
Copy link
Contributor Author

v1.4.3 and the feature of lazyRestMapper is closed.

@sbueringer
Copy link
Member

sbueringer commented Jul 21, 2023

Okay so back to your issue on release-1.4. I think I know what's going on.

Initial client creation:

  • controlplane/kubeadm/internal/cluster.go:115 m.Tracker.GetClient
    • controllers/remote/cluster_cache_tracker.go:331 if !cache.WaitForCacheSync(cacheSyncCtx) {
      • This ensures the cache is synced at time of client creation
      • The client is now cached

APIserver becomes unreachable

In a Reconcile after APIServer becomes unreachable

  • controlplane/kubeadm/internal/workload_cluster_rbac.go:55 w.Client.Get
    • pkg/cache/internal/informers_map.go:208
      • We didn't request this resource before so the Get call now creates a new informer and then waits until it is synced
      • Because the cache never syncs and we don't set a timeout in this call via ctx we're stuck forever

I think at this point it doesn't matter if the healtcheck deletes the clusterAccessor / client / cache as we'll just stay stuck.
The cacheCtx we cancel when stopping the cache has no impact on the Get call as it is not used there (potential idea for a fix)

image
(stuck in WaitForCacheSync with no timeout in the ctx)

@Levi080513
Copy link
Contributor Author

Levi080513 commented Jul 21, 2023

Yes, what you said is correct.
One fix is ​​to cancel all get/list requests when the cacheCtx is closed.

@Levi080513
Copy link
Contributor Author

Can i try to fix this? @sbueringer

@sbueringer
Copy link
Member

sbueringer commented Jul 21, 2023

Oh sorry. Already working on a fix. PR should be up in a bit

(want to get this up / reviewed / merged ASAP to get it into the release & patch releases on Tuesday)

@sbueringer
Copy link
Member

@jessehu @Levi080513 It would be super helpful if you can take a look at #9027 and potentially verify if the fix works for you (it works for me)

@jessehu
Copy link
Contributor

jessehu commented Jul 21, 2023

Thanks a lot @sbueringer. We verified the PR works as expected.

@sbueringer
Copy link
Member

sbueringer commented Jul 21, 2023

We'll cherry-pick both fixes. So with the next series of releases (next Tuesday) all should be fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
5 participants