Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-node cannot refresh expired serviceaccount token due to apiserver throttling #7694

Open
wedaly opened this issue May 24, 2023 · 2 comments

Comments

@wedaly
Copy link
Contributor

wedaly commented May 24, 2023

Expected Behavior

calico-node is able to refresh the serviceaccount token used by calico CNI

Current Behavior

AKS had a customer report the following errors from Calico CNI: "error getting ClusterInformation: connection is unauthorized: Unauthorized". The errors occurred consistently across multiple nodes for about 30 hours, then resolved without intervention. The cluster had 8 nodes at the time the incident occurred.

Upon investigation, we discovered that:

  1. requests from calico-node to apiserver to create the serviceaccount token were failing due to a 429 response from apiserver. The errors in the calico-node logs looked like:
2023-05-06 05:25:29.092 [ERROR][87] cni-config-monitor/token_watch.go 131: Failed to update CNI token, retrying... error=the server was unable to return a response in the time allotted, but may still be processing the request (post serviceaccounts calico-node)
  1. we observed that all of the 429 responses were coming from one pod of apiserver (out of six pods running)

  2. apiserver was using API Priority and Fairness and classifying the requests from calico-node as workload-low

Possible Solution

I suspect this might be a new failure mode introduced by #5910. In particular, when calico-node instances happen to connect to an overloaded apiserver replica and the CNI token expires, apiserver may throttle the requests to create a new serviceaccount token. This prevents the calico-node from refreshing the token, causing CNI failures.

Steps to Reproduce (for bugs)

I'm not sure how to reproduce this issue. We had a customer report the problem, and it resolved after ~30 hours without intervention.

Context

We had one customer report this issue running AKS-managed Calico.

Your Environment

  • Calico version v3.24.0
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.24.9
  • Operating System and version: Linux (Ubuntu 18.04)
  • Link to your project (optional):
@caseydavenport
Copy link
Member

Oooh, fun.

Sounds like one option here might be to configure FlowSchemas for Calico so that it's not lumped into the "workload-low" category, which is obviously not quite correct for a critical infrastructure component.

I'm not sure we can ship one of those by default in Calico, as it probably will vary by cluster configuration, but perhaps it's something we should add to our documentation.

As for code changes we might make, those are a bit less obvious. Maybe one of these?

  • Increase the validity period of our tokens so that we're less susceptible to transient errors like this.
  • Perhaps there is some timeout we can increase so that if the apiserver is burdened we wait a bit longer?

@wedaly
Copy link
Contributor Author

wedaly commented May 25, 2023

Sounds like one option here might be to configure FlowSchemas for Calico so that it's not lumped into the "workload-low" category, which is obviously not quite correct for a critical infrastructure component.

I'm not sure we can ship one of those by default in Calico, as it probably will vary by cluster configuration, but perhaps it's something we should add to our documentation.

That sounds like a reasonable solution. Agree that apiserver shouldn't be classifying requests from Calico as workload-low.

As for code changes we might make, those are a bit less obvious. Maybe one of these?

  • Increase the validity period of our tokens so that we're less susceptible to transient errors like this.
  • Perhaps there is some timeout we can increase so that if the apiserver is burdened we wait a bit longer?

There are some env vars in client-go that enable exponential backoff https://github.com/kubernetes/kubernetes/blob/3d27dee047a87527735bf74cfcc6b8ff8875f66c/staging/src/k8s.io/client-go/rest/client.go#L36-L37 I'm not completely sure it would have helped in this case, but might be worth exploring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants