calico-node container reboot missed the serviceaccount token update event #7208

AndyEWang · 2023-01-18T06:09:28Z

calico-node container reboot missed the serviceaccount token update event and cannot update /host/etc/cni/net.d/calico-kubeconfig. This leads to following calico cni error.
2023-01-16 05:06:47.002 [ERROR][48795] plugin.go 121: Final result of CNI ADD was an error. error=error getting ClusterInformation: connection is unauthorized: Unauthorized
2023-01-16 05:06:47.111 [ERROR][48860] plugin.go 518: Final result of CNI DEL was an error. error=error getting ClusterInformation: connection is unauthorized: Unauthorized

Expected Behavior

calico cni should access apiserver successfully after calico-node container reboot and no need to wait for the next token refresh event.

Current Behavior

calico cni cannot access apiserver until the next token refresh event which is about 1 hour by default.

Possible Solution

calico-node container boot process should try to update /host/etc/cni/net.d/calico-kubeconfig as soon as possible。

Steps to Reproduce (for bugs)

1.scale Typha replicas to be zero
2.calico-node container begins to reboot because of Typha access error.
3.kubelet will try to reboot calico-node container with an exponential back-off delay
4.until the serviceaccount token is updated in host. ("inspect" container to find the mounted host directory)
5.restore Typha and let calico-node reboot successfully
6.compare /etc/cni/net.d/calico-kubeconfig with the serviceaccount token in host directory.

Context

Our cluster depends on OpenYurt to implement node autonomy. When node autonomy is enabled, the Pods in this node won't be rescheduled and still keep running. And if node network is down, the calico-node container will reboot because of Typha access error. After node network is recovered, calico-node becomes running but cni isn't allowed to access apiserver. This leads to the CNI ADD or DEL error.

Your Environment

Calico version: v3.21.4
Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.22
Operating System and version: CentOS 8.4

Josh-Tigera · 2023-01-23T17:59:05Z

The good news is that this looks very much like another issue already on our radar which is currently being worked on: #7171

The bad news is that the fix will not be ported into the next maintenance release of v3.21 since that's outside of our support window (current - 3).

Going to close this issue but please do take a look at the issue I linked and if you think what you're seeing is sufficiently different and warrants a separate look we can re-open and evaluate from there.

Josh-Tigera added kind/bug impact/high likelihood/low labels Jan 23, 2023

Josh-Tigera closed this as completed Jan 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calico-node container reboot missed the serviceaccount token update event #7208

calico-node container reboot missed the serviceaccount token update event #7208

AndyEWang commented Jan 18, 2023

Josh-Tigera commented Jan 23, 2023

calico-node container reboot missed the serviceaccount token update event #7208

calico-node container reboot missed the serviceaccount token update event #7208

Comments

AndyEWang commented Jan 18, 2023

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Josh-Tigera commented Jan 23, 2023