You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
calico-node is able to refresh the serviceaccount token used by calico CNI
Current Behavior
AKS had a customer report the following errors from Calico CNI: "error getting ClusterInformation: connection is unauthorized: Unauthorized". The errors occurred consistently across multiple nodes for about 30 hours, then resolved without intervention. The cluster had 8 nodes at the time the incident occurred.
2023-05-06 05:25:29.092 [ERROR][87] cni-config-monitor/token_watch.go 131: Failed to update CNI token, retrying... error=the server was unable to return a response in the time allotted, but may still be processing the request (post serviceaccounts calico-node)
we observed that all of the 429 responses were coming from one pod of apiserver (out of six pods running)
apiserver was using API Priority and Fairness and classifying the requests from calico-node as workload-low
Possible Solution
I suspect this might be a new failure mode introduced by #5910. In particular, when calico-node instances happen to connect to an overloaded apiserver replica and the CNI token expires, apiserver may throttle the requests to create a new serviceaccount token. This prevents the calico-node from refreshing the token, causing CNI failures.
Steps to Reproduce (for bugs)
I'm not sure how to reproduce this issue. We had a customer report the problem, and it resolved after ~30 hours without intervention.
Context
We had one customer report this issue running AKS-managed Calico.
Your Environment
Calico version v3.24.0
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.24.9
Operating System and version: Linux (Ubuntu 18.04)
Link to your project (optional):
The text was updated successfully, but these errors were encountered:
Sounds like one option here might be to configure FlowSchemas for Calico so that it's not lumped into the "workload-low" category, which is obviously not quite correct for a critical infrastructure component.
I'm not sure we can ship one of those by default in Calico, as it probably will vary by cluster configuration, but perhaps it's something we should add to our documentation.
As for code changes we might make, those are a bit less obvious. Maybe one of these?
Increase the validity period of our tokens so that we're less susceptible to transient errors like this.
Perhaps there is some timeout we can increase so that if the apiserver is burdened we wait a bit longer?
Sounds like one option here might be to configure FlowSchemas for Calico so that it's not lumped into the "workload-low" category, which is obviously not quite correct for a critical infrastructure component.
I'm not sure we can ship one of those by default in Calico, as it probably will vary by cluster configuration, but perhaps it's something we should add to our documentation.
That sounds like a reasonable solution. Agree that apiserver shouldn't be classifying requests from Calico as workload-low.
As for code changes we might make, those are a bit less obvious. Maybe one of these?
Increase the validity period of our tokens so that we're less susceptible to transient errors like this.
Perhaps there is some timeout we can increase so that if the apiserver is burdened we wait a bit longer?
Expected Behavior
calico-node is able to refresh the serviceaccount token used by calico CNI
Current Behavior
AKS had a customer report the following errors from Calico CNI: "error getting ClusterInformation: connection is unauthorized: Unauthorized". The errors occurred consistently across multiple nodes for about 30 hours, then resolved without intervention. The cluster had 8 nodes at the time the incident occurred.
Upon investigation, we discovered that:
we observed that all of the 429 responses were coming from one pod of apiserver (out of six pods running)
apiserver was using API Priority and Fairness and classifying the requests from calico-node as
workload-low
Possible Solution
I suspect this might be a new failure mode introduced by #5910. In particular, when calico-node instances happen to connect to an overloaded apiserver replica and the CNI token expires, apiserver may throttle the requests to create a new serviceaccount token. This prevents the calico-node from refreshing the token, causing CNI failures.
Steps to Reproduce (for bugs)
I'm not sure how to reproduce this issue. We had a customer report the problem, and it resolved after ~30 hours without intervention.
Context
We had one customer report this issue running AKS-managed Calico.
Your Environment
The text was updated successfully, but these errors were encountered: