Failed to create pod sandbox: rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8379

eliassal · 2023-12-28T19:43:02Z

I have K8S up and running and able to deploy and run different Pods/containers. Today, I tried to deply mysql to it with PVC and PV
After deploying, container get stuck in "ContainerCreating" status, gets terminated and recreated

When I d describe Pod I see this

Events:
  Type     Reason                  Age               From               Message
  ----     ------                  ----              ----               -------
  Normal   Scheduled               61s               default-scheduler  Successfully assigned default/mysql-74799d694c-j4mcr to chef-u16desk
  Warning  FailedCreatePodSandBox  60s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d9f4c91394d548cca1b189e665fd0532f158bb5bb4407153aa48e1af40afe2f0": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
  Normal   SandboxChanged          5s (x5 over 60s)  kubelet            Pod sandbox changed, it will be killed and re-created.

Expected Behavior

Pod should run with persistent volume

Current Behavior

Pod get stuck in "ContainerCreating" status

Context

Enclosed the yaml files
mysql-storage.txt
mysql-deployment.txt
for PV, PVC and mysql deployment

Your Environment

Calico version : as indicated above I used https://raw.githubusercontent.com/projectcalico/calico/master/manifests/calico.yaml
Orchestrator version: kubernetes 1.26
Operating System and version: ubuntu 22.04

The text was updated successfully, but these errors were encountered:

caseydavenport · 2023-12-28T21:20:48Z

I'd recommend using manifests from an official release, as it's possible that master is unstable for some reason. v3.27.0 is the most recent release right now.

error getting ClusterInformation: connection is unauthorized: Unauthorized

This error suggests that the calico-cni-plugin serviceaccount doesn't have permission to get ClusterInformations. Can you share the contents of this command?

kubectl get clusterrole calico-cni-plugin -o yaml

eliassal · 2023-12-28T21:26:54Z

Here is the ouput of the command
~/Projects/DeployMySQL-OnKubernetes$ kubectl get clusterrole calico-cni-plugin -o yaml

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"annotations":{},"name":"calico-cni-plugin"},"rules":[{"apiGroups":[""],"resources":["pods","nodes","namespaces"],"verbs":["get"]},{"apiGroups":[""],"resources":["pods/status"],"verbs":["patch"]},{"apiGroups":["crd.projectcalico.org"],"resources":["blockaffinities","ipamblocks","ipamhandles","clusterinformations","ippools","ipreservations","ipamconfigs"],"verbs":["get","list","create","update","delete"]}]}
  creationTimestamp: "2023-04-14T14:53:00Z"
  name: calico-cni-plugin
  resourceVersion: "766"
  uid: e328e973-e60a-4d13-96b0-901df58d1ccc
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - nodes
  - namespaces
  verbs:
  - get
- apiGroups:
  - ""
  resources:
  - pods/status
  verbs:
  - patch
- apiGroups:
  - crd.projectcalico.org
  resources:
  - blockaffinities
  - ipamblocks
  - ipamhandles
  - clusterinformations
  - ippools
  - ipreservations
  - ipamconfigs
  verbs:
  - get
  - list
  - create
  - update
  - delete

caseydavenport · 2023-12-28T21:35:29Z

Seems like the CNI plugin has permissions to get ClusterInformation, which suggests this isn't an RBAC issue as much as a more general authorization issue.

This thread includes a number of potential reasons why this might happen: #5712

Including:

NTP synchronization issues
Expired certificates being given to the CNI plugin

How old is this cluster by the way?

caseydavenport · 2023-12-28T21:42:16Z

One thing that might be useful to check is if restarting calico/node on the affected node improves things at all.

eliassal · 2023-12-29T09:07:13Z

@caseydavenport , if you mean NTP synchronization issues between Master and node, then they show exactly same time no difference.
Cluster is 1 year old, it was setup in January of this year.

Expired certificates being given to the CNI plugin how can I check this and renew it?

eliassal · 2023-12-29T13:26:02Z

@caseydavenport YES, I rebooted the VM and pod was created succesfuly and I was able to access the mysql container. Can you please tell me what can be the root of this issue that disturbed calico from functioning correctly?
Thanks again for your help

caseydavenport · 2023-12-29T14:42:39Z

Restarting the node suggests there was some temporary state in place that had expired and was refreshed on reboot. The most likely thing would be the CNI plugin's bearer token.

What version of Calico do you have installed?

e.g.,

kubectl get clusterinformations -o yaml

Newer versions of Calico should automatically update the token to prevent cases like this starting in v3.24 it seems: #5910

However, you would need to be on v3.24 or greater and also have properly updated manifests that volume mount the necessary CNI configuration directory into calico/node so that it can provide refreshed tokens to the CNI plugin. Otherwise, I think the tokens expire after about a year.

eliassal · 2023-12-29T15:18:22Z

Hi, here is the output of the command

kubectl get clusterinformations -o yaml

Calico is V 3.26

apiVersion: v1
items:
- apiVersion: crd.projectcalico.org/v1
  kind: ClusterInformation
  metadata:
    annotations:
      projectcalico.org/metadata: '{"uid":"fc5f4dde-71ef-4937-9430-46a80ed38299","creationTimestamp":"2023-04-14T15:04:55Z"}'
    creationTimestamp: "2023-04-14T15:04:55Z"
    generation: 1
    name: default
    resourceVersion: "1443"
    uid: 51fdcd2c-eb0b-40c8-834c-8986d0bdd420
  spec:
    calicoVersion: v3.26.0-0.dev-403-gf8c46d4273ba
    clusterGUID: 6ddd81728f22472096a3e0c64e1ba716
    clusterType: k8s,bgp,kubeadm,kdd
    datastoreReady: true
kind: List
metadata:
  resourceVersion: ""

caseydavenport · 2023-12-29T16:04:47Z

3.26.0-0.dev-403-gf8c46d4273ba

Interesting, looks like a dev build is being run rather than a production release?

eliassal · 2023-12-29T18:13:07Z

OK, so what should I do? Should I switch to production stable release? If yes, how? Thanks

matthewdupre · 2024-01-02T16:11:01Z

@eliassal I'm curious how you ended up installing the newest code from github (back in ~April) rather than a production release - do you remember where you started installing from?

Generally speaking everyone should stick with a stable release unless you're testing something out that hasn't been released yet. https://docs.tigera.io/calico/latest/about/ has the docs for the current release (v3.27.0)

eliassal · 2024-01-02T18:03:47Z

Thanks @matthewdupre but the link you provided does not indicate how to upgrade and if there is any chance to break current config.
Even me, I dont remember exactly how but I remember I followed instructions from one of the courses on cl;oudguru or pluralsight.

caseydavenport · 2024-01-02T19:11:44Z

There are upgrade docs in the side bar: https://docs.tigera.io/calico/latest/operations/upgrading/kubernetes-upgrade

@eliassal I'm afraid I can't guarantee you won't break your config - you're running an unreleased / unsupported version of Calico.

eliassal · 2024-01-03T11:06:00Z

@caseydavenport OK, I will go througfh the upgrade doc but tell me whta is calicoctl? I dont have this tool on my cluster?

caseydavenport · 2024-01-03T18:40:46Z

@eliassal you can ignore the section about Host Endpoints - that's only for upgrades from versions older than v3.14.

You can read about calicoctl in the documentation, it's a CLI tool.

davhdavh · 2024-01-26T02:37:29Z

I had same problem after upgrade to 3.27.0, but a complete restart of calico solved it:

kubectl delete pods --all -n calico-system --force

davhdavh · 2024-01-30T08:57:57Z

Nope, seems 3.27 is total fubar. Had to restart calico 4 times already

mazdakn · 2024-01-30T18:06:04Z

@davhdavh what's the error you get in your cluster?
If it's the same access error mentioned in the description of this issue, then what's the output of this command:

kubectl get clusterrole calico-cni-plugin -o yaml

caseydavenport · 2024-02-01T01:46:27Z

I believe the "Unauthorized" error message to be distinct from the typical RBAC error. IIUC, if this was an RBAC issue, we'd see additional context along the lines of this:

system:serviceaccount:calico-system:calico-cni-plugin is unable to get clusterinformations at cluster scope

(or similar, writing it out from memory)

I believe the simple "Unauthorized" means that there is something more fundamental going on - i.e., the certificates in-use have expired or perhaps the token itself has expired.

caseydavenport · 2024-02-01T01:51:21Z

Another issue with the same symptom: #7171

Relevant bit:

When API server token/certificate get rotated, calico is trying to authenticate using current token, which is invalid as API server token was rotated. Due to this calico is failing to authenticate with API server which results in failing to add network to POD.

One thing to check here would be the calico/node pod logs from the affected node - does it contain any logs indicating that it has successfully (or unsuccessfully) refreshed the CNI plugin token? You'll want to look for logs from token_watch.go

davhdavh · 2024-02-01T05:26:10Z

it is installed via helm with pretty basic settings, except using the new windows setup that 3.27 bring...
And kubectl get clusterrole calico-cni-plugin -o yaml returns the same as the above.
Every single time, it is the windows that break. ie, if I restart 'calico-node-windows-xxxx' it will work again.

Last log entry for calico-node-windows is yesterday evening, so nothing.
Killed calico-node-windows, and 1 min later:

caseydavenport · 2024-02-01T16:34:59Z

Aha, yes that's important context if it's only happening on Windows nodes. Likely a bug in how the token refresh works on Windows nodes (or perhaps isn't being enabled on Windows nodes?). CC @coutinhop

davhdavh · 2024-02-05T04:40:48Z

Any workaround? Pretty tired of the clusters being half broken every morning

coutinhop · 2024-02-05T17:53:02Z

@davhdavh if I understood you correctly, you're now using the Windows operator install that came out in v3.27.0, right? Could you set LogSeverityScreen to debug in the default FelixConfiguration and provide logs for the Windows pods (ideally all of them: uninstall-calico, install-cni, node, felix). Anything in particular to look out for when trying to reproduce your issue? If there is something broken with the token refresh mechanism, I'd assume this happens after a set period of time that the cluster is running, is that correct? How many Linux and Windows nodes do you have in your cluster? Are you using VXLAN? What version of kubernetes are you using? What version of containerd in the Windows nodes?

davhdavh · 2024-02-06T07:03:04Z

if I understood you correctly, you're now using the Windows operator install that came out in v3.27.0, right?

Yes. We were using the manual host-process setup in 3.26, so it really shouldn't be a very big change.

Could you set LogSeverityScreen to debug in the default FelixConfiguration and provide logs for the Windows pods (ideally all of them: uninstall-calico, install-cni, node, felix).

sure, will send next time it is stuck.

Anything in particular to look out for when trying to reproduce your issue?

No, we should be running with the most basic setup there is that includes windows.

If there is something broken with the token refresh mechanism, I'd assume this happens after a set period of time that the cluster is running, is that correct?

Yes, but it is long enough that I haven't figured out the timing yet.

How many Linux and Windows nodes do you have in your cluster?

Happens on both our dev cluster (1 main linux worker + control-plane and 2 micro control-planes and 1 windows node)
and preprod cluster (3 main linux worker + control-plane and 2 windows node). That's only clusters we upgraded to 3.27 so far.

Are you using VXLAN?

Yes.

tigera-operator:
  enabled: true
  installation:
    serviceCIDRs:
    - 10.96.0.0/12
    calicoNetwork:
      windowsDataplane: HNS
      # enable iptable port forwarding
      containerIPForwarding: Enabled
      bgp: Disabled
      # Note: The ipPools section cannot be modified post-install.
      ipPools:
      - blockSize: 26
        cidr: 10.168.0.0/16
        disableBGPExport: false
        encapsulation: VXLAN
        natOutgoing: Enabled
        nodeSelector: all()

kubernetes-services-endpoint + kube-proxy-windows daemonset is our entire config.

What version of kubernetes are you using?

v1.29.0

What version of containerd in the Windows nodes?

1.7.2 for dev and 1.7.11 for preprod.

davhdavh · 2024-02-07T10:10:21Z

@coutinhop
Detected problem at 01:39:32 (log time).
I have very few pods starting on windows around that time, and it is only a problem on start and terminate. No problem for pods that keep running.
So it is probably the update at 01:33:39.297 or 01:33:43.858 that was the cause.

Here are the logs...
calico-node-windows.zip

davhdavh · 2024-02-16T11:17:52Z

@coutinhop any workarounds? it is getting quite annoying to have to fix this manually every single day

davhdavh · 2024-02-20T04:21:09Z

Here is a small workaround script to monitor the problem, and kill the pods

while true; 
  do 
    kubectl get events --all-namespaces -o json --watch --watch-only | \
    jq 'select(.message | test(".*error getting ClusterInformation.*")) | .reportingInstance'  --unbuffered | \
    while read line; do 
      kubectl -n calico-system get pods --selector app.kubernetes.io/name=calico-node-windows; 
      kubectl -n calico-system delete pod --selector app.kubernetes.io/name=calico-node-windows --force;
      date;
    done; 
done

coutinhop · 2024-02-21T00:02:27Z

@davhdavh sorry for the delay! While I could not find anything relevant in the logs you provided, that lead me to look into the exact reason why I couldn't find any token refresher messages in the logs, and it turns out it doesn't run on windows 😢
It currently is only invoked on the runit service scripts, which are not used by Calico for Windows:

calico/node/filesystem/etc/service/available/cni/run

Line 3 in d5905aa

exec calico-node -monitor-token

I'll get started right away on working that into the Windows scripts...

In the meantime, I'm glad you found a work around. I'll keep you posted on a fix...

eliassal · 2024-07-30T10:42:25Z

@caseydavenport @matthewdupre Hi, I decided to reinstall K8s on a new fresh ubuntu, I am a little bit confused about instructions at https://docs.tigera.io/calico/latest/getting-started/kubernetes/quickstart
I need that pod network to be 192.168.200.0, should I run step 1 and step 2?
2nd, it is indicated in the begining that we should run
sudo kubeadm init --pod-network-cidr=192.168......
I have already did kube admin as follows
sudo kubeadm init --control-plane-endpoint=kubernetes --upload-certs
and it was succesful, so should I run the init again with the pod network 192.168.200.0 or download the manifests in step 1 and 2 update them then apply them?
Thanks for your help

caseydavenport · 2024-07-30T14:51:54Z

@eliassal please open a new issue - sounds unrelated to the original problem here and best to keep separate concerns separated for anyone looking in the future.

coutinhop added the kind/support label Jan 9, 2024

caseydavenport added kind/bug likelihood/high impact/high area/windows and removed kind/support labels Feb 1, 2024

This was referenced Feb 27, 2024

[CORE-10081] Run token refresher on Windows #8563

Merged

[release-v3.27] Auto pick #8563: Run token refresher on Windows #8571

Merged

coutinhop closed this as completed in #8563 Feb 28, 2024

caseydavenport mentioned this issue Mar 4, 2024

After few hours, news pods do not get network #8577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to create pod sandbox: rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8379

Failed to create pod sandbox: rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8379

eliassal commented Dec 28, 2023

caseydavenport commented Dec 28, 2023

eliassal commented Dec 28, 2023

caseydavenport commented Dec 28, 2023

caseydavenport commented Dec 28, 2023

eliassal commented Dec 29, 2023

eliassal commented Dec 29, 2023

caseydavenport commented Dec 29, 2023

eliassal commented Dec 29, 2023

caseydavenport commented Dec 29, 2023

eliassal commented Dec 29, 2023

matthewdupre commented Jan 2, 2024

eliassal commented Jan 2, 2024

caseydavenport commented Jan 2, 2024

eliassal commented Jan 3, 2024

caseydavenport commented Jan 3, 2024

davhdavh commented Jan 26, 2024

davhdavh commented Jan 30, 2024

mazdakn commented Jan 30, 2024

caseydavenport commented Feb 1, 2024

caseydavenport commented Feb 1, 2024

davhdavh commented Feb 1, 2024

caseydavenport commented Feb 1, 2024

davhdavh commented Feb 5, 2024

coutinhop commented Feb 5, 2024

davhdavh commented Feb 6, 2024

davhdavh commented Feb 7, 2024

davhdavh commented Feb 16, 2024

davhdavh commented Feb 20, 2024

coutinhop commented Feb 21, 2024

eliassal commented Jul 30, 2024

caseydavenport commented Jul 30, 2024

Failed to create pod sandbox: rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8379

Failed to create pod sandbox: rpc - error getting ClusterInformation connection is unauthorized: Unauthorized #8379

Comments

eliassal commented Dec 28, 2023

Expected Behavior

Current Behavior

Context

Your Environment

caseydavenport commented Dec 28, 2023

eliassal commented Dec 28, 2023

caseydavenport commented Dec 28, 2023

caseydavenport commented Dec 28, 2023

eliassal commented Dec 29, 2023

eliassal commented Dec 29, 2023

caseydavenport commented Dec 29, 2023

eliassal commented Dec 29, 2023

caseydavenport commented Dec 29, 2023

eliassal commented Dec 29, 2023

matthewdupre commented Jan 2, 2024

eliassal commented Jan 2, 2024

caseydavenport commented Jan 2, 2024

eliassal commented Jan 3, 2024

caseydavenport commented Jan 3, 2024

davhdavh commented Jan 26, 2024

davhdavh commented Jan 30, 2024

mazdakn commented Jan 30, 2024

caseydavenport commented Feb 1, 2024

caseydavenport commented Feb 1, 2024

davhdavh commented Feb 1, 2024

caseydavenport commented Feb 1, 2024

davhdavh commented Feb 5, 2024

coutinhop commented Feb 5, 2024

davhdavh commented Feb 6, 2024

davhdavh commented Feb 7, 2024

davhdavh commented Feb 16, 2024

davhdavh commented Feb 20, 2024

coutinhop commented Feb 21, 2024

eliassal commented Jul 30, 2024

caseydavenport commented Jul 30, 2024