calico-node cannot refresh expired serviceaccount token due to apiserver throttling #7694

wedaly · 2023-05-24T20:28:32Z

Expected Behavior

calico-node is able to refresh the serviceaccount token used by calico CNI

Current Behavior

AKS had a customer report the following errors from Calico CNI: "error getting ClusterInformation: connection is unauthorized: Unauthorized". The errors occurred consistently across multiple nodes for about 30 hours, then resolved without intervention. The cluster had 8 nodes at the time the incident occurred.

Upon investigation, we discovered that:

requests from calico-node to apiserver to create the serviceaccount token were failing due to a 429 response from apiserver. The errors in the calico-node logs looked like:

2023-05-06 05:25:29.092 [ERROR][87] cni-config-monitor/token_watch.go 131: Failed to update CNI token, retrying... error=the server was unable to return a response in the time allotted, but may still be processing the request (post serviceaccounts calico-node)

we observed that all of the 429 responses were coming from one pod of apiserver (out of six pods running)
apiserver was using API Priority and Fairness and classifying the requests from calico-node as workload-low

Possible Solution

I suspect this might be a new failure mode introduced by #5910. In particular, when calico-node instances happen to connect to an overloaded apiserver replica and the CNI token expires, apiserver may throttle the requests to create a new serviceaccount token. This prevents the calico-node from refreshing the token, causing CNI failures.

Steps to Reproduce (for bugs)

I'm not sure how to reproduce this issue. We had a customer report the problem, and it resolved after ~30 hours without intervention.

Context

We had one customer report this issue running AKS-managed Calico.

Your Environment

Calico version v3.24.0
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.24.9
Operating System and version: Linux (Ubuntu 18.04)
Link to your project (optional):

The text was updated successfully, but these errors were encountered:

caseydavenport · 2023-05-24T22:29:40Z

Oooh, fun.

Sounds like one option here might be to configure FlowSchemas for Calico so that it's not lumped into the "workload-low" category, which is obviously not quite correct for a critical infrastructure component.

I'm not sure we can ship one of those by default in Calico, as it probably will vary by cluster configuration, but perhaps it's something we should add to our documentation.

As for code changes we might make, those are a bit less obvious. Maybe one of these?

Increase the validity period of our tokens so that we're less susceptible to transient errors like this.
Perhaps there is some timeout we can increase so that if the apiserver is burdened we wait a bit longer?

wedaly · 2023-05-25T18:51:01Z

Sounds like one option here might be to configure FlowSchemas for Calico so that it's not lumped into the "workload-low" category, which is obviously not quite correct for a critical infrastructure component.

I'm not sure we can ship one of those by default in Calico, as it probably will vary by cluster configuration, but perhaps it's something we should add to our documentation.

That sounds like a reasonable solution. Agree that apiserver shouldn't be classifying requests from Calico as workload-low.

As for code changes we might make, those are a bit less obvious. Maybe one of these?

Increase the validity period of our tokens so that we're less susceptible to transient errors like this.

Perhaps there is some timeout we can increase so that if the apiserver is burdened we wait a bit longer?

There are some env vars in client-go that enable exponential backoff https://github.com/kubernetes/kubernetes/blob/3d27dee047a87527735bf74cfcc6b8ff8875f66c/staging/src/k8s.io/client-go/rest/client.go#L36-L37 I'm not completely sure it would have helped in this case, but might be worth exploring.

caseydavenport added area/documentation kind/enhancement labels May 30, 2023

caseydavenport mentioned this issue Oct 4, 2024

plugin type="calico" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized #9295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calico-node cannot refresh expired serviceaccount token due to apiserver throttling #7694

calico-node cannot refresh expired serviceaccount token due to apiserver throttling #7694

wedaly commented May 24, 2023 •

edited

Loading

caseydavenport commented May 24, 2023

wedaly commented May 25, 2023

calico-node cannot refresh expired serviceaccount token due to apiserver throttling #7694

calico-node cannot refresh expired serviceaccount token due to apiserver throttling #7694

Comments

wedaly commented May 24, 2023 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

caseydavenport commented May 24, 2023

wedaly commented May 25, 2023

wedaly commented May 24, 2023 •

edited

Loading