-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calico pods not ready/up after kubernetes node reboot #6687
Comments
Looks like it is attempting to get a token, but is being refused. Something to check here would be your apiserver logs to see why it is rejecting the TokenRequest. |
Here is the apiserver logs.. kubeproxy basically has only unauthorized messages, one of them is here --> https://justpaste.it/9p381 not sure if this helps, but this is how cluster is currently created..
which i guess should have same signing for all apiservers right? |
I see that the apiserver logs are complaining about the token:
which looks like two different tokens are being used: https://github.com/kubernetes/kubernetes/blob/03c76decb2c2e74b2bdd9e46f15c84235f6c6cd3/pkg/serviceaccount/claims.go#L124-L129 I'm not exactly sure what that means. WDYT @caseydavenport ?
@vinayus did you try the reboot on a clean v3.24.x cluster? |
hi @lmm if you referring to VM reboot then yes and that is how it is reproduced. upgrade was done using |
If kube-proxy is experiencing the same issue then this is definitely not a Calico problem, but a problem with the serviceaccount token signing or distribution infrastructure for the cluster. Somehow the tokens being provided to clients on that node (kube-proxy and Calico) are outdated. VM reboot causing the issue points to perhaps something that is invalidating the node's tokens on shutdown? @lmm I'm guessing the two tokens are from different components, since it seems like more than just Calico is hitting this issue. |
I don't see any errors in pod description or logs of kube-proxy.. They look like below atleast now..
Just couple of times out of 20 trials, we had a proper recover without any failures. We tried to narrow down the difference and so far unlucky. One observation is few calico-node pods are ready(1/1) but in logs post unauthorized messages like below.
Under apiserver(verbose logging) we do see unauthorized issue for GET calls which seems related maybe?
If there is any other info I can capture please do ask, will try to gather as much as possible |
Looks like we figured out the issue. The main root cause seems to be ntp sync between hypervisor(esxi in our case) and these vm's. Journalctl logs of ntpd gave us that slew was -28000s which maps to hypervisor's time which is way off. Sadly the hypervisor time was manually set and not in sync with any public time servers and all underlying vm's were affected. We overlooked because we always ended taking time check couple of minutes later by then time would've been in sync. As a simple precaution, I added dependencies to docker and kublet systemd files to wait for ntp service to be properly up before starting and inside ntp systemd service, we added a force immediate sync( This was not an issue with calico or kubernetes sadly. Apologies for incorrect issue. Please close the issue if nobody else is having this case. I have cross-checked this on both kubernetes 1.22.12 and 1.23.10 over calico 3.24.1 |
We have a kubernetes cluster(3 masters and 9 worker nodes) and during maintenance cycle all VM's of the cluster will be powered off. When the VM's are powered on calico nodes and other pods are not healthy. I have updated calico from earlier 3.22 to 3.24.1 and also have tried on a clean cluster as well.
Pods are in createcontainererror and some log unauthorized messages in the container console.
As a workaround, I am restarting all calico pods which returns them into healthy state over a period.
I have been going through multiple issues such as #5910 but haven't found any luck. I added couple of extra clusterroles which didn't help. Calico.yaml manifest is from the official site with only image repository change since the cluster is without internet.
Kindly help
Expected Behavior
Pods should recover or restart themselves.
Current Behavior
Unhealthy pods observed in multiple namespaces
kubectl get pods -n kube-system -> https://justpaste.it/950st
kubectl logs -n kube-system calico-kube-controllers-74574fb497-f97c2 --tail=100 -> https://justpaste.it/9cplv
kubectl describe pod -n kube-system calico-kube-controllers-74574fb497-f97c2 -> https://justpaste.it/8h8zl
kubectl logs -n kube-system calico-node-6msxr -c install-cni --tail=100 -> https://justpaste.it/2sq4h
kubectl describe pod -n kube-system calico-node-6msxr -> https://justpaste.it/7qg8c
kubectl logs -n kube-system coredns-6c99bc4cf8-qdn6x -> https://justpaste.it/4kwd2
some direct logs captured
logs from other pod console where
piraeus-op-cs
is a service namesome logs from /var/log/messages
sysctl -p output
Possible Solution
As a workaround, I'm currently using
kubectl delete pod --all --all-namespaces --force
or restarting calico pods alone. Looking for suggestions on this.Steps to Reproduce (for bugs)
Context
This has added additional steps to recover the setup by force killing all pods.
Your Environment
The text was updated successfully, but these errors were encountered: