Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-node container reboot missed the serviceaccount token update event #7208

Closed
AndyEWang opened this issue Jan 18, 2023 · 1 comment
Closed

Comments

@AndyEWang
Copy link

calico-node container reboot missed the serviceaccount token update event and cannot update /host/etc/cni/net.d/calico-kubeconfig. This leads to following calico cni error.
2023-01-16 05:06:47.002 [ERROR][48795] plugin.go 121: Final result of CNI ADD was an error. error=error getting ClusterInformation: connection is unauthorized: Unauthorized
2023-01-16 05:06:47.111 [ERROR][48860] plugin.go 518: Final result of CNI DEL was an error. error=error getting ClusterInformation: connection is unauthorized: Unauthorized

Expected Behavior

calico cni should access apiserver successfully after calico-node container reboot and no need to wait for the next token refresh event.

Current Behavior

calico cni cannot access apiserver until the next token refresh event which is about 1 hour by default.

Possible Solution

calico-node container boot process should try to update /host/etc/cni/net.d/calico-kubeconfig as soon as possible。

Steps to Reproduce (for bugs)

1.scale Typha replicas to be zero
2.calico-node container begins to reboot because of Typha access error.
3.kubelet will try to reboot calico-node container with an exponential back-off delay
4.until the serviceaccount token is updated in host. ("inspect" container to find the mounted host directory)
5.restore Typha and let calico-node reboot successfully
6.compare /etc/cni/net.d/calico-kubeconfig with the serviceaccount token in host directory.

Context

Our cluster depends on OpenYurt to implement node autonomy. When node autonomy is enabled, the Pods in this node won't be rescheduled and still keep running. And if node network is down, the calico-node container will reboot because of Typha access error. After node network is recovered, calico-node becomes running but cni isn't allowed to access apiserver. This leads to the CNI ADD or DEL error.

Your Environment

  • Calico version: v3.21.4
  • Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.22
  • Operating System and version: CentOS 8.4
@Josh-Tigera
Copy link
Contributor

The good news is that this looks very much like another issue already on our radar which is currently being worked on: #7171

The bad news is that the fix will not be ported into the next maintenance release of v3.21 since that's outside of our support window (current - 3).

Going to close this issue but please do take a look at the issue I linked and if you think what you're seeing is sufficiently different and warrants a separate look we can re-open and evaluate from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants