Token refresh miss causes rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8777

abasitt · 2024-04-30T16:51:16Z

We are facing below issue..... It happens very randomly, let's say one node in hundreds and the restart of the calico pod on that node will fix the problem.
Failed to create pod sandbox: rpc - error getting ClusterInformation connection is unauthorized: Unauthorized
#5712

Expected Behavior

Token_watch.go shouldn't miss the token refresh in the specified interval and if it misses it, it should at least generate some more errors.

Current Behavior

Rare but it keep occurring that on some worker node the token_watch.go will stop.

Possible Solution

fix the possible bug or add a liveness probe if the token refresh fails, it restart the calico-node pod.

Steps to Reproduce (for bugs)

Randomly happening and not able to reproduce the error so far.

Context

The difference that we noticed between working and non-working worker node is below. The logs are few days old but what we noticed, the watcher stop and it doesn't even share more details about failure.

docker logs ef870a87d78d | grep  "Update of CNI kubeconfig triggered based on elapsed time"
2024-04-18 19:16:06.166 [INFO][71] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-19 04:59:40.964 [INFO][71] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-19 15:50:02.894 [INFO][71] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-19 23:31:36.702 [INFO][71] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.

This token watch message stopped a few days back

On a working calico
2024-04-19 14:30:53.137 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-19 22:40:13.894 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-20 09:04:36.635 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-20 19:15:50.917 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-21 04:44:23.317 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-21 11:32:55.537 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-21 21:52:51.894 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-22 08:18:24.179 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-22 15:16:36.326 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-23 02:36:42.771 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-23 11:55:53.946 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.
2024-04-23 23:45:50.323 [INFO][99] cni-config-monitor/token_watch.go 225: Update of CNI kubeconfig triggered based on elapsed time.

Your Environment

Calico version: 3.24.3
Orchestrator version: kubernetes v1.26.0
Operating System and version: Rockey 8.6

The text was updated successfully, but these errors were encountered:

caseydavenport · 2024-05-01T22:51:35Z

@abasitt thanks for raising. I think the interesting thing to dig into here would be why the periodic update seems to just stop.

On the nodes encountering this issue, do you see the following log immediately after the "Update of CNI kubeconfig" log?

Wrote updated CNI kubeconfig file.

This would indicate that it is at least writing the file, and not getting stuck.

Are there any other logs being emitted from token_watch.go?

abasitt · 2024-05-02T06:02:54Z

@caseydavenport thank you for looking in to the issue, it seems to be getting stuck. See below. Let me know if you need any more details.

docker logs ef870a87d78d | grep -i Wrote
2024-04-18 19:16:04.964 [INFO][24] tunnel-ip-allocator/ipam.go 167: Wrote affinity as pending cidr=172.16.39.192/26 host="hypervvm-69-188.robinsystems.com"
2024-04-18 19:16:06.167 [INFO][71] cni-config-monitor/token_watch.go 279: Wrote updated CNI kubeconfig file. path="/host/etc/cni/net.d/calico-kubeconfig"
2024-04-19 04:59:40.964 [INFO][71] cni-config-monitor/token_watch.go 279: Wrote updated CNI kubeconfig file. path="/host/etc/cni/net.d/calico-kubeconfig"
2024-04-19 15:50:02.895 [INFO][71] cni-config-monitor/token_watch.go 279: Wrote updated CNI kubeconfig file. path="/host/etc/cni/net.d/calico-kubeconfig"
2024-04-19 23:31:36.703 [INFO][71] cni-config-monitor/token_watch.go 279: Wrote updated CNI kubeconfig file. path="/host/etc/cni/net.d/calico-kubeconfig"
[root@hypervvm-69-188 tmp]# date
Wed May  1 22:57:46 PDT 2024

slight correction about k8s version, its v1.26.0

caseydavenport · 2024-05-02T17:21:40Z

@abasitt if possible, a complete log as opposed to a filtered one might be helpful here to see what else is going on on that node.

abasitt · 2024-05-10T07:41:53Z

@caseydavenport sorry for late response. The pod issue was resolved and will share the full logs once I find another pod in this state.

caseydavenport · 2024-06-04T16:19:52Z

@abasitt any news on this issue?

abasitt · 2024-06-06T08:06:38Z

@caseydavenport still waiting to hear from field about the issue.

caseydavenport · 2024-07-24T17:27:14Z

@abasitt just checking in on this one to see if there's been anything new

abasitt · 2024-07-25T13:17:22Z

@caseydavenport my apologies for not providing any status update. We didn't see this issue in a while and also planning to upgrade all clusters to Calico 3.28. I will close this for now and if I see the issue again, I know what to get and will share the details. Thank you for the help.

MaiSao · 2024-11-14T06:46:11Z

@caseydavenport my apologies for not providing any status update. We didn't see this issue in a while and also planning to upgrade all clusters to Calico 3.28. I will close this for now and if I see the issue again, I know what to get and will share the details. Thank you for the help.

Any update in this issue. Did you see the issue again ?

abasitt · 2024-11-14T07:18:49Z

@MaiSao none of folks managing ours clusters reported it and I couldn't recreate it either. We have seen it twice but I couldn't get a hold on the broken node. Are you seeing it also ?

MaiSao · 2024-11-14T07:27:28Z

@MaiSao none of folks managing ours clusters reported it and I couldn't recreate it either. We have seen it twice but I couldn't get a hold on the broken node. Are you seeing it also ?

Our old k8s cluster, version 1.21, calico v3.18 sometimes happened. I see this issue resolved in version calico > v3.23 #6218

matthewdupre assigned caseydavenport Jun 18, 2024

matthewdupre added the kind/bug label Jun 18, 2024

caseydavenport added likelihood/low impact/high labels Jun 18, 2024

abasitt closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token refresh miss causes rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8777

Token refresh miss causes rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8777

abasitt commented Apr 30, 2024 •

edited

Loading

caseydavenport commented May 1, 2024

abasitt commented May 2, 2024 •

edited

Loading

caseydavenport commented May 2, 2024

abasitt commented May 10, 2024

caseydavenport commented Jun 4, 2024

abasitt commented Jun 6, 2024

caseydavenport commented Jul 24, 2024

abasitt commented Jul 25, 2024

MaiSao commented Nov 14, 2024

abasitt commented Nov 14, 2024

MaiSao commented Nov 14, 2024

Token refresh miss causes rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8777

Token refresh miss causes rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8777

Comments

abasitt commented Apr 30, 2024 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

caseydavenport commented May 1, 2024

abasitt commented May 2, 2024 • edited Loading

caseydavenport commented May 2, 2024

abasitt commented May 10, 2024

caseydavenport commented Jun 4, 2024

abasitt commented Jun 6, 2024

caseydavenport commented Jul 24, 2024

abasitt commented Jul 25, 2024

MaiSao commented Nov 14, 2024

abasitt commented Nov 14, 2024

MaiSao commented Nov 14, 2024

abasitt commented Apr 30, 2024 •

edited

Loading

abasitt commented May 2, 2024 •

edited

Loading