Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI plugin: error getting ClusterInformation: connection is unauthorized: Unauthorized #5712

Closed
earthlovepython opened this issue Mar 5, 2022 · 37 comments · Fixed by #5910
Closed

Comments

@earthlovepython
Copy link

earthlovepython commented Mar 5, 2022

K8S & Calico information

HostOS: RHEL 8.2
K8S: on-premise cluster; version is v1.21.1; "IPVS" mode; IP4/IP6 dual stack; installed using kubespray
Calico: version is v3.18.4; non-BGP mode; enabled "IP6" DNAT.
Our docker image is built on top of "RHEL ubi:8"
We do not setup external ETCD cluster.

"kubectl describe" output

[support@node-cont-1-qa conf]$ kubectl describe pod export-job-job-dp8hb
Name:           export-job-job-dp8hb
Namespace:      pio
Priority:       0
Node:           node-df1-1/10.0.156.180
Start Time:     Wed, 23 Feb 2022 05:57:18 -0800
Labels:         app.kubernetes.io/instance=export-job-job
                controller-uid=5d9f3e4b-e74c-4280-a3be-e31d37e92b84
                job-name=export-job-job
Annotations:    cni.projectcalico.org/podIP:
                cni.projectcalico.org/podIPs:
Status:         Pending
IP:
IPs:            <none>
Controlled By:  Job/export-job-job
Containers:
  export-job-job:
    Container ID:
    Image:         10.0.156.250:5000/img-admf:9.3.0.0B038
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      csh
    Args:
      -c
      source /TT9/configXcp.sh; lis_conf; python2 /etc/pio/APPL/XcdbBackup.py --exportdb --dir /var/tmp; sleep 300
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  512Mi
    Requests:
      cpu:        200m
      memory:     256Mi
    Environment:  <none>
    Mounts:
      /TT9/PIO/9.0.0/RUN/config/APPL/DBConMgr.cnfg from db-conf (rw,path="DBConMgr.cnfg")
      /TT9/PIO/9.0.0/RUN/config/feature_conf.json from feature-conf (rw,path="feature_conf.json")
      /TT9/PIO/9.0.0/RUN/license/license.json from license-conf (rw,path="license.json")
      /etc/pio/APPL/XcdbBackup.py from job-script (rw,path="XcdbBackup.py")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jh7lg (ro)
      /var/tmp from external-pv (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  job-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      export-job-script
    Optional:  false
  db-conf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  db-secret
    Optional:    false
  feature-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      feature
    Optional:  false
  license-conf:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      license
    Optional:  false
  external-pv:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  backup-pvc
    ReadOnly:   false
  kube-api-access-jh7lg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                 From               Message
  ----     ------                  ----                ----               -------
  Normal   Scheduled               52m                 default-scheduler  Successfully assigned pio/export-job-job-dp8hb to node-df1-1
  Warning  FailedCreatePodSandBox  52m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "e46d8d9df11ef97e7e1d8b38ced7efef32e1cb4bfb0aa85809cb3198464b6167" network for pod "export-job-job-dp8hb": networkPlugin cni failed to set up pod "export-job-job-dp8hb_pio" network: connection is unauthorized: Unauthorized, failed to clean up sandbox container "e46d8d9df11ef97e7e1d8b38ced7efef32e1cb4bfb0aa85809cb3198464b6167" network for pod "export-job-job-dp8hb": networkPlugin cni failed to teardown pod "export-job-job-dp8hb_pio" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]
  Normal   SandboxChanged          50m (x10 over 52m)  kubelet            Pod sandbox changed, it will be killed and re-created.

Expected Behavior

Should start POD successfully

Steps to Reproduce

Sorry, the issue happened two times on different K8S cluster in our lab. And I did not keep any logs....
Myself want to know to reproduce too.

My initial thought(maybe wrong)

Since "kubectl describe" has "connection is unauthorized", I searched source code of K8S v1.21.1. K8S code does NOT has it. Then search it in Calico v3.22 (I am using V3.18.4, but there is not be big difference), find that "connection is unauthorized" exist in "libcalico-go/lib/erros/errors.go" . So, looks like the issue is caused by Calico. Then, use "error getting ClusterInformation" as keyword to search in K8S code but cannot find. And search in Calico code, can find it. So, I have confidence to say the issue is 100% related with Calico.

Because "connection is unauthorized" error prompt is related with "type ErrorConnectionUnauthorized struct", and "ErrorConnectionUnauthorized " is related with cooperation with ETCD, looks like that the issue is communication issue between Calico and ETCD.

By the way, /var/log/calico/cni/ does NOT has anything related with "etcd" during POD start/destroy while I did normal operation.

What I expect:

If possible, can you please tell me
1). Which webpage describes control/data flow between Calico and ETCD
2). log files and location that whole Calico uses
3). Did I miss any debug information

Thanks

@lmm
Copy link
Contributor

lmm commented Mar 8, 2022

This looks similar to: #4857

@lmm
Copy link
Contributor

lmm commented Mar 8, 2022

In that linked issue, one of the user's reported that they did not see the issue with k8s 1.20 - that might be worth trying if that k8s version is an option for you.

@earthlovepython
Copy link
Author

Thanks for your all for the information.

Cannot replace K8S v1.21 to v1.20.

Will continue to debug and share process with you guys.

Thanks

@caseydavenport caseydavenport changed the title After K8S cluster run stable for weeks, K8S cannot start POD (randomly) because "error getting ClusterInformation: connection is unauthorized: Unauthorized" CNI plugin: error getting ClusterInformation: connection is unauthorized: Unauthorized Mar 22, 2022
@lwr20
Copy link
Member

lwr20 commented Mar 22, 2022

k8s 1.21 added token cycling and pro-active token withdrawal when pods are deleted. As far as we can tell, there's something weird about how this interacts with Calico on some systems.

I'm attempting to reproduce, but no luck so far. Do you have any clues about how to trigger this behaviour?

@Chainsaw-does-brr
Copy link

Chainsaw-does-brr commented May 11, 2022

clusterVersion: v1.23

Not sure how useful my comment would be, but I encountered this error when i accidentally rebooted one of the nodes in the cluster.
The full error is as follows:
error killing pod: failed to "KillPodSandbox" for "%some-guid%" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"%some-pod-id%\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"

The killing was triggered due to disk pressure event being triggered on the node, reasons of which I'm no entirely sure. Lowered imageGC thresholds a bit before, but from my understanding they shouldn't trigger disk pressure. Maybe I'm wrong.

ps: I also recall a similar situation with an api that constnatly got evicted every couple of days (disk pressure) and it's evicted pods were never cleaned up. Didn't really look up into why the pods remained, but maybe they also were supposed to be cleaned up, but never did because of this error.

@nebulakid
Copy link

@earthlovepython: this may sound silly, but could you try to restart kubelet.

Had a little bit of fun today with it.. Spend hours of upgrading from different k8s and Calico versions, in order to reproduce the issue on a second cluster, led my nowhere. I then tried to activate debug logging and restarted kubelet and suddenly all pods (in my case calico-kube-controller and coredns) just became ready...

Luckily I had a snapshot of all VMs of the "broken" cluster, so I could verify that, and I can confirm it works like a charm.

I guess networkPlugin cni failed to teardown pod combined with the kubelet log message like "Error deleting pod from network" means, that some kind of cleanup/garbage collection needs to run.
Restarting kubelet seems to trigger this.

@nebulakid
Copy link

Found a way to reproduce this issue, at least on a "Calico the hard way" setup (haven't tested it with regular deployments):
kill all calico (kube-controller, typha, node) and coredns pods.

After this the cni logs also print these messages:

2022-05-16 13:05:37.905 [DEBUG][104919] client.go 30: Using datastore type 'kubernetes'
2022-05-16 13:05:37.906 [DEBUG][104919] k8s.go 210: Calico is configured to use calico-ipam
2022-05-16 13:05:37.906 [DEBUG][104919] k8s.go 628: Performing 'Get' for ClusterInformation(default)
2022-05-16 13:05:37.906 [DEBUG][104919] customresource.go 205: Get custom Kubernetes resource Key=ClusterInformation(default) Resource="ClusterInformations" Revision=""
2022-05-16 13:05:37.906 [DEBUG][104919] customresource.go 216: Get custom Kubernetes resource by name Key=ClusterInformation(default) Name="default" Namespace="" Resource="ClusterInformations" Revision=""
2022-05-16 13:05:37.912 [DEBUG][104919] customresource.go 224: Error getting resource Key=ClusterInformation(default) Name="default" Namespace="" Resource="ClusterInformations" Revision="" error=Unauthorized
2022-05-16 13:05:37.912 [ERROR][104919] plugin.go 518: Final result of CNI DEL was an error. error=error getting ClusterInformation: connection is unauthorized: Unauthorized

Restarting kubelet afterwards again, solves the issue again.

Running: k8s 1.21.15 / 1.22.9 and calico 3.21.5

@aberenshtein
Copy link

When do you plan on releasing 3.23.2 ?
We have this error that's preventing from performing a rolling update to the cluster

@caseydavenport
Copy link
Member

Aiming to cut that release this week.

@winkee01
Copy link

winkee01 commented Jul 3, 2022

Encountering the same issue, how to solve it?

k get pod --all-namespaces
NAMESPACE          NAME                                       READY   STATUS             RESTARTS        AGE
calico-apiserver   calico-apiserver-645c75cf84-ffrk9          1/1     Running            0               8m27s
calico-apiserver   calico-apiserver-645c75cf84-qs4vq          1/1     Running            0               8m27s
calico-system      calico-kube-controllers-59b7bbd897-d59ff   1/1     Running            0               14m
calico-system      calico-node-ngsh8                          1/1     Running            0               21m
calico-system      calico-typha-54b78d9586-4xf2v              1/1     Running            0               21m
kube-system        coredns-6d4b75cb6d-cxgj8                   0/1     CrashLoopBackOff   8 (4m37s ago)   45m
kube-system        coredns-6d4b75cb6d-nnmtb                   0/1     CrashLoopBackOff   8 (4m45s ago)   45m
kube-system        etcd-k8s-master                           1/1     Running            1               45m
kube-system        kube-apiserver-k8s-master                 1/1     Running            1               45m
kube-system        kube-controller-manager-k8s-master       1/1     Running            1               45m
kube-system        kube-proxy-8k6fp                           1/1     Running            0               45m
kube-system        kube-scheduler-k8s-master                1/1     Running            1               45m
tigera-operator    tigera-operator-5dc8b759d9-dsxcf           1/1     Running            0               21m

Examining coredns-6d4b75cb6d-cxgj8 will give me this error:

Warning  FailedCreatePodSandBox  19m (x17 over 23m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a2dac06f53fa4d3b7b425592ce34cb21af5cd082edada3c1b77e56aefa2f7fa1": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
  Warning  BackOff                 7s (x93 over 17m)   kubelet            Back-off restarting failed container

Please help

@caseydavenport
Copy link
Member

@winkee01 what version of Calico are you using?

@winkee01
Copy link

winkee01 commented Jul 6, 2022

I installed the latest version of Calico, and kubernetes is 1.24.2

@caseydavenport
Copy link
Member

@winkee01 I'd recommend opening a new issue and filing it out with the exact version of Calico as well as other platform and environment information as requested by the issue template.

@lbogdan
Copy link
Contributor

lbogdan commented Jul 19, 2022

This just happened to me in an older v1.22.3 cluster, and I've noticed that the calico-node pods had an age of 365d. The problem self-resolved after I deleted all calico-node pods and they were recreated. Is there a certificate / token that has a TTL of 1 year and doesn't get automatically renewed?

@colliwhopper
Copy link

hit the same issue, and lbogdan's workaround fixed it for me.

@linehrr
Copy link

linehrr commented Aug 30, 2022

in our case, it was the cert expired.
we did cluster cert renewal by kubeadm and then restarted all the calico nodes, problem solved.

@nareshmaharaj-consultant

Same issue in 1.22 with Calico

Events:
Type Reason Age From Message


Normal Scheduled 3m47s default-scheduler Successfully assigned default/pod-with-cm to worker-node01
Warning FailedCreatePodSandBox 3m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "3dcfdb21462e255a8f4059ca8540c8df05863bd6444cb22290133f894840845e" network for pod "pod-with-cm": networkPlugin cni failed to set up pod "pod-with-cm_default" network: error getting ClusterInformation: connection is unauthorized: Unauthorized, failed to clean up sandbox container "3dcfdb21462e255a8f4059ca8540c8df05863bd6444cb22290133f894840845e" network for pod "pod-with-cm": networkPlugin cni failed to teardown pod "pod-with-cm_default" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]

@BurlyLuo
Copy link

v3.23.2

root@bpf1:~/wspace/wcni/calico/2-calico-vxlan# kk describe pods calico-kube-controllers-c55c48989-z4kbt
Name:                      calico-kube-controllers-c55c48989-z4kbt
Namespace:                 kube-system
Priority:                  2000000000
Priority Class Name:       system-cluster-critical
Node:                      bpf1/192.168.2.71
Start Time:                Sun, 13 Nov 2022 13:14:26 +0800
Labels:                    k8s-app=calico-kube-controllers
                           pod-template-hash=c55c48989
Annotations:               cni.projectcalico.org/containerID: 0020fe737e31e0082d371b9707d9831cbe9ad72f00c0d9b33fb0e745a8d5b439
                           cni.projectcalico.org/podIP: 10.244.11.66/32
                           cni.projectcalico.org/podIPs: 10.244.11.66/32
Status:                    Terminating (lasts 3m51s)
Termination Grace Period:  30s
IP:                        10.244.11.66
IPs:
  IP:           10.244.11.66
Controlled By:  ReplicaSet/calico-kube-controllers-c55c48989
Containers:
  calico-kube-controllers:
    Container ID:   docker://e9770e244d60bbc2301e2dc5da478c85b80273f6cbe100b5ee4522b8b738ac48
    Image:          192.168.2.100:5000/calico/kube-controllers:v3.23.2
    Image ID:       docker-pullable://192.168.2.100:5000/calico/kube-controllers@sha256:57c40fdfb86dce269a8f93b4f5545b23b7ee9ba36d62e67e7ce367df8d753887
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Sun, 13 Nov 2022 13:14:31 +0800
      Finished:     Sun, 13 Nov 2022 14:53:05 +0800
    Ready:          False
    Restart Count:  0
    Liveness:       exec [/usr/bin/check-status -l] delay=10s timeout=10s period=10s #success=1 #failure=6
    Readiness:      exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ENABLED_CONTROLLERS:  node
      DATASTORE_TYPE:       kubernetes
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dsmrc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-dsmrc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason         Age                  From     Message
  ----     ------         ----                 ----     -------
  Normal   Killing        4m21s                kubelet  Stopping container calico-kube-controllers
  Warning  FailedKillPod  4s (x24 over 4m20s)  kubelet  error killing pod: failed to "KillPodSandbox" for "02a4f270-45e4-4bd7-805d-dfe99763ea66" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"calico-kube-controllers-c55c48989-z4kbt_kube-system\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"
root@bpf1:~/wspace/wcni/calico/2-calico-vxlan# 

@Venture200
Copy link

Same issue in 1.22 with Calico

Events: Type Reason Age From Message

Normal Scheduled 3m47s default-scheduler Successfully assigned default/pod-with-cm to worker-node01 Warning FailedCreatePodSandBox 3m46s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "3dcfdb21462e255a8f4059ca8540c8df05863bd6444cb22290133f894840845e" network for pod "pod-with-cm": networkPlugin cni failed to set up pod "pod-with-cm_default" network: error getting ClusterInformation: connection is unauthorized: Unauthorized, failed to clean up sandbox container "3dcfdb21462e255a8f4059ca8540c8df05863bd6444cb22290133f894840845e" network for pod "pod-with-cm": networkPlugin cni failed to teardown pod "pod-with-cm_default" network: error getting ClusterInformation: connection is unauthorized: Unauthorized]

did you fix this

@snowsky
Copy link

snowsky commented Dec 15, 2022

Run into a similar issue and worked around by NTP synchronization :)

@marpiech
Copy link

marpiech commented Jan 12, 2023

This just happened to me in an older v1.22.3 cluster, and I've noticed that the calico-node pods had an age of 365d. The problem self-resolved after I deleted all calico-node pods and they were recreated. Is there a certificate / token that has a TTL of 1 year and doesn't get automatically renewed?

I had slightly different issue, but restarting calico pod on the node with failed pod and then the failed pod helped. Pod moved to another node after restart. MicroK8s v1.26.0 revision 4390, Calico v3.23.5

@vyom-soft
Copy link

Hi,
Restarting the calico pods are not working for me. I am on v1.24.6 kubespray

@vyom-soft
Copy link

networkPlugin cni failed to teardown po

I did restart, NetworkManager, Containerd & Kubelet. Still the problem remains.

@microyahoo
Copy link

hi @vyom-soft, Is the configuration of autodetect interface in your calico correct? if calico was not able to identify the Ethernet card property, for example, it was configured to detect the eth but on machine it was configured as ens so placing the regex helps it to identify the ethernet card and associated ip properly.
https://www.unixcloudfusion.in/2022/02/solved-caliconode-is-not-ready-bird-is.html

@usersina
Copy link

usersina commented Apr 4, 2023

I just had to kubectl delete pod calico-node-xxxx on the node where the issue was happening. A new Pod was created and the problem is solved.

@32328254
Copy link

kubectl delete pod calico-node-xxxx -n kube-system , A new Pod was created and the problem is solved.

@davidassigbi
Copy link

kubectl delete pods --all --all-namespaces fixed my issue

I had a similar issue today and all the pods on my cluster were stuck in Unknown or Terminating status, including the calico-node-xxxx.
I ran kubectl delete pod calico-node-xxxx which fixed the calico-node pod, but the other pods were still not ok, so I ran kubectl delete pods --all --all-namespaces to delete ALL the pods and a couple of minutes after the command everything was back up and running well!

@robswc
Copy link

robswc commented Aug 29, 2023

@davidassigbi 's solution worked for me. Was using Microk8s, guess something weird with calico-node-xxxx haven't noticed the issue return.

@HeavenElite
Copy link

Sorry, I'm still a learner. But kubectl delete -f calico.yaml seems better.
I learned it from StackOverFlow.

@fasaxc
Copy link
Member

fasaxc commented Jan 2, 2024

NO! Do not do this unless you want to completely remove Calico and trash your cluster; it will delete Calico's IP address database.

Sorry, I'm still a learner. But kubectl delete -f calico.yaml seems better.
I learned it from StackOverFlow.

@lvtujingji
Copy link

我修改了master 和node的时间,然后出现了相同的报错Jan 25 11:34:46 test-node01 kubelet: E0125 11:34:46.938130 56745 remote_runtime.go:269] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox "969d9ca8a8fef61b41ed6810db45e3d5765e85a62f080ae9135a1d61cf508417": plugin type="calico" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized" podSandboxID="969d9ca8a8fef61b41ed6810db45e3d5765e85a62f080ae9135a1d61cf508417"
我删除了calico后重新安装calico 一切恢复了正常

@maciejewskikamil
Copy link

我修改了master 和node的时间,然后出现了相同的报错Jan 25 11:34:46 test-node01 kubelet: E0125 11:34:46.938130 56745 remote_runtime.go:269] "StopPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to destroy network for sandbox "969d9ca8a8fef61b41ed6810db45e3d5765e85a62f080ae9135a1d61cf508417": plugin type="calico" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized" podSandboxID="969d9ca8a8fef61b41ed6810db45e3d5765e85a62f080ae9135a1d61cf508417" 我删除了calico后重新安装calico 一切恢复了正常

I have the same problem on windows node.

@imtzer
Copy link

imtzer commented Mar 3, 2024

This just happened to me in an older v1.22.3 cluster, and I've noticed that the calico-node pods had an age of 365d. The problem self-resolved after I deleted all calico-node pods and they were recreated. Is there a certificate / token that has a TTL of 1 year and doesn't get automatically renewed?

It Works! I haved synced time in my VM before, maybe it's the reaseon

@sherlock-wong
Copy link

clusterVersion: v1.23

Not sure how useful my comment would be, but I encountered this error when i accidentally rebooted one of the nodes in the cluster. The full error is as follows: error killing pod: failed to "KillPodSandbox" for "%some-guid%" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"%some-pod-id%\" network: error getting ClusterInformation: connection is unauthorized: Unauthorized"

The killing was triggered due to disk pressure event being triggered on the node, reasons of which I'm no entirely sure. Lowered imageGC thresholds a bit before, but from my understanding they shouldn't trigger disk pressure. Maybe I'm wrong.

ps: I also recall a similar situation with an api that constnatly got evicted every couple of days (disk pressure) and it's evicted pods were never cleaned up. Didn't really look up into why the pods remained, but maybe they also were supposed to be cleaned up, but never did because of this error.

it's useful for me!!! I got the same mistake, and I reboot the machine, than it's recovered.

@h888866j
Copy link

h888866j commented Sep 24, 2024

I met issue again. It was first seen earlier this month. Resolved by rebooting the nodes because they are just my test env on VMware. I paused VMs during nights.
Unfortunately, issue occurred again today when I add a volume mounts. It stuck at terminating state. Error with 10.96.0.1 were seen, net/http: TLS handshake timeout. I dumped some logs if anyone were interested in it.

  Warning  FailedKillPod  2m16s (x25 over 7m10s)  kubelet  error killing pod: failed to "KillPodSandbox" for "a431db4f-976a-468c-9b30-083cae3f4a1a" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"aad27fa53c53747a4565e60dab22133157199656d297699d297c86dc50f85560\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized"
[root@master1 prom]$ k get pods | grep prom
prometheus-adapter-6c4cc5465b-q6mnd    1/1     Running       0          32h
prometheus-adapter-6c4cc5465b-z76x4    1/1     Running       0          32h
prometheus-k8s-0                       2/2     Running       0          31h
prometheus-k8s-1                       0/2     Terminating   0          31h
prometheus-operator-57cf88fbcb-wt8x9   2/2     Running       0          32h
k describe pod prometheus-k8s-1
[root@master1 prom]$ k describe pod prometheus-k8s-1
Name:                      prometheus-k8s-1
Namespace:                 monitoring
Priority:                  0
Service Account:           prometheus-k8s
Node:                      node1/192.168.234.17
Start Time:                Mon, 23 Sep 2024 15:44:09 +0800
Labels:                    app.kubernetes.io/component=prometheus
                           app.kubernetes.io/instance=k8s
                           app.kubernetes.io/managed-by=prometheus-operator
                           app.kubernetes.io/name=prometheus
                           app.kubernetes.io/part-of=kube-prometheus
                           app.kubernetes.io/version=2.46.0
                           controller-revision-hash=prometheus-k8s-7c7bdb6c6d
                           operator.prometheus.io/name=k8s
                           operator.prometheus.io/shard=0
                           prometheus=k8s
                           statefulset.kubernetes.io/pod-name=prometheus-k8s-1
Annotations:               cni.projectcalico.org/containerID: aad27fa53c53747a4565e60dab22133157199656d297699d297c86dc50f85560
                           cni.projectcalico.org/podIP: 10.244.166.179/32
                           cni.projectcalico.org/podIPs: 10.244.166.179/32
                           kubectl.kubernetes.io/default-container: prometheus
Status:                    Terminating (lasts 42m)
Termination Grace Period:  600s
IP:                        10.244.166.179
IPs:
  IP:           10.244.166.179
Controlled By:  StatefulSet/prometheus-k8s
Init Containers:
  init-config-reloader:
    Container ID:  containerd://270f4290bb379004cf2760024c939d29884265a8fcf1a0de9e5d6dab5aaa2553
    Image:         quay.io/prometheus-operator/prometheus-config-reloader:v0.67.1
    Image ID:      quay.io/prometheus-operator/prometheus-config-reloader@sha256:0fe3cf36985e0e524801a0393f88fa4b5dd5ffdf0f091ff78ee02f2d281631b5
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/prometheus-config-reloader
    Args:
      --watch-interval=0
      --listen-address=:8080
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 23 Sep 2024 15:44:14 +0800
      Finished:     Mon, 23 Sep 2024 15:44:39 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     10m
      memory:  50Mi
    Requests:
      cpu:     10m
      memory:  50Mi
    Environment:
      POD_NAME:  prometheus-k8s-1 (v1:metadata.name)
      SHARD:     0
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5fpsk (ro)
Containers:
  prometheus:
    Container ID:  containerd://c7c473c643ba6694af8f025e916893ba6be98c90bb5177e3d0a87f454bdf8eff
    Image:         quay.io/prometheus/prometheus:v2.46.0
    Image ID:      quay.io/prometheus/prometheus@sha256:d6ead9daf2355b9923479e24d7e93f246253ee6a5eb18a61b0f607219f341a80
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --web.console.templates=/etc/prometheus/consoles
      --web.console.libraries=/etc/prometheus/console_libraries
      --config.file=/etc/prometheus/config_out/prometheus.env.yaml
      --web.enable-lifecycle
      --web.route-prefix=/
      --storage.tsdb.retention.time=24h
      --storage.tsdb.path=/prometheus
      --web.config.file=/etc/prometheus/web_config/web-config.yaml
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 23 Sep 2024 15:44:41 +0800
      Finished:     Tue, 24 Sep 2024 21:55:16 +0800
    Ready:          False
    Restart Count:  0
    Requests:
      memory:     400Mi
    Liveness:     http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6
    Readiness:    http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=3
    Startup:      http-get http://:web/-/ready delay=0s timeout=3s period=15s #success=1 #failure=60
    Environment:  <none>
    Mounts:
      /etc/prometheus/certs from tls-assets (ro)
      /etc/prometheus/config_out from config-out (ro)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /etc/prometheus/secrets/etcd-healthcheck-certs from etcd-healthcheck-certs (rw)
      /etc/prometheus/web_config/web-config.yaml from web-config (ro,path="web-config.yaml")
      /prometheus from prometheus-k8s-db (rw,path="prometheus-db")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5fpsk (ro)
  config-reloader:
    Container ID:  containerd://12228eb60a98c74333db746c34649cbc6e48de7a8555284305cc83ad5ab150e2
    Image:         quay.io/prometheus-operator/prometheus-config-reloader:v0.67.1
    Image ID:      quay.io/prometheus-operator/prometheus-config-reloader@sha256:0fe3cf36985e0e524801a0393f88fa4b5dd5ffdf0f091ff78ee02f2d281631b5
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/prometheus-config-reloader
    Args:
      --listen-address=:8080
      --reload-url=http://localhost:9090/-/reload
      --config-file=/etc/prometheus/config/prometheus.yaml.gz
      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
      --watched-dir=/etc/prometheus/rules/prometheus-k8s-rulefiles-0
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 23 Sep 2024 15:44:43 +0800
      Finished:     Tue, 24 Sep 2024 21:55:20 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     10m
      memory:  50Mi
    Requests:
      cpu:     10m
      memory:  50Mi
    Environment:
      POD_NAME:  prometheus-k8s-1 (v1:metadata.name)
      SHARD:     0
    Mounts:
      /etc/prometheus/config from config (rw)
      /etc/prometheus/config_out from config-out (rw)
      /etc/prometheus/rules/prometheus-k8s-rulefiles-0 from prometheus-k8s-rulefiles-0 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5fpsk (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  prometheus-k8s-db:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  prometheus-k8s-db-prometheus-k8s-1
    ReadOnly:   false
  config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s
    Optional:    false
  tls-assets:
    Type:                Projected (a volume that contains injected data from multiple sources)
    SecretName:          prometheus-k8s-tls-assets-0
    SecretOptionalName:  <nil>
  config-out:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  prometheus-k8s-rulefiles-0:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      prometheus-k8s-rulefiles-0
    Optional:  false
  web-config:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-k8s-web-config
    Optional:    false
  etcd-healthcheck-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  etcd-healthcheck-certs
    Optional:    false
  kube-api-access-5fpsk:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason         Age                   From     Message
  ----     ------         ----                  ----     -------
  Normal   Killing        52m                   kubelet  Stopping container prometheus
  Normal   Killing        52m                   kubelet  Stopping container config-reloader
  Warning  Unhealthy      51m (x2 over 52m)     kubelet  Readiness probe failed: Get "http://10.244.166.179:9090/-/ready": dial tcp 10.244.166.179:9090: connect: connection refused
  Warning  FailedKillPod  115s (x231 over 51m)  kubelet  error killing pod: failed to "KillPodSandbox" for "a431db4f-976a-468c-9b30-083cae3f4a1a" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for sandbox \"aad27fa53c53747a4565e60dab22133157199656d297699d297c86dc50f85560\": plugin type=\"calico\" failed (delete): error getting ClusterInformation: connection is unauthorized: Unauthorized"


Your Environment
Calico version: 3.28.0
Orchestrator version: kubernetes v1.26.15
Operating System and version: Rockey 9.4

[root@master3 ~]$ k get pods -n kube-system -owide | grep calico
calico-kube-controllers-599ff45f46-htxqq           1/1     Running   12 (2d4h ago)    61d    10.244.136.12    master3   <none>           <none>
calico-node-4694r                                  1/1     Running   11 (2d5h ago)    61d    192.168.234.13   master3   <none>           <none>
calico-node-58wxz                                  1/1     Running   17 (2d5h ago)    61d    192.168.234.17   node1     <none>           <none>
calico-node-k7vh6                                  1/1     Running   35 (2d5h ago)    61d    192.168.234.18   node2     <none>           <none>
calico-node-nfzc9                                  1/1     Running   11 (2d5h ago)    61d    192.168.234.11   master1   <none>           <none>
calico-node-rtj6c                                  1/1     Running   14 (2d3h ago)    61d    192.168.234.12   master2   <none>           <none>

calico node yaml:
[root@master2 ~]$ k -n kube-system get pods calico-node-58wxz  -oyaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-07-25T11:09:44Z"
  generateName: calico-node-
  labels:
    controller-revision-hash: dd6b874b5
    k8s-app: calico-node
    pod-template-generation: "1"
  name: calico-node-58wxz
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: DaemonSet
    name: calico-node
    uid: 7071a2a6-f396-4b8f-ac47-44d94c8887ad
  resourceVersion: "3471983"
  uid: 6c9030e0-2700-4ee4-8653-abff4bd180b6
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
          - key: metadata.name
            operator: In
            values:
            - node1
  containers:
  - env:
    - name: DATASTORE_TYPE
      value: kubernetes
    - name: WAIT_FOR_DATASTORE
      value: "true"
    - name: NODENAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: CALICO_NETWORKING_BACKEND
      valueFrom:
        configMapKeyRef:
          key: calico_backend
          name: calico-config
    - name: CLUSTER_TYPE
      value: k8s,bgp
    - name: IP
      value: autodetect
    - name: CALICO_IPV4POOL_IPIP
      value: Always
    - name: CALICO_IPV4POOL_VXLAN
      value: Never
    - name: CALICO_IPV6POOL_VXLAN
      value: Never
    - name: FELIX_IPINIPMTU
      valueFrom:
        configMapKeyRef:
          key: veth_mtu
          name: calico-config
    - name: FELIX_VXLANMTU
      valueFrom:
        configMapKeyRef:
          key: veth_mtu
          name: calico-config
    - name: FELIX_WIREGUARDMTU
      valueFrom:
        configMapKeyRef:
          key: veth_mtu
          name: calico-config
    - name: CALICO_IPV4POOL_CIDR
      value: 10.244.0.0/16
    - name: CALICO_DISABLE_FILE_LOGGING
      value: "true"
    - name: FELIX_DEFAULTENDPOINTTOHOSTACTION
      value: ACCEPT
    - name: FELIX_IPV6SUPPORT
      value: "false"
    - name: FELIX_HEALTHENABLED
      value: "true"
    envFrom:
    - configMapRef:
        name: kubernetes-services-endpoint
        optional: true
    image: docker.io/calico/node:v3.28.0
    imagePullPolicy: IfNotPresent
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/calico-node
          - -shutdown
    livenessProbe:
      exec:
        command:
        - /bin/calico-node
        - -felix-live
        - -bird-live
      failureThreshold: 6
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    name: calico-node
    readinessProbe:
      exec:
        command:
        - /bin/calico-node
        - -felix-ready
        - -bird-ready
      failureThreshold: 3
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 10
    resources:
      requests:
        cpu: 250m
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host/etc/cni/net.d
      name: cni-net-dir
    - mountPath: /lib/modules
      name: lib-modules
      readOnly: true
    - mountPath: /run/xtables.lock
      name: xtables-lock
    - mountPath: /var/run/calico
      name: var-run-calico
    - mountPath: /var/lib/calico
      name: var-lib-calico
    - mountPath: /var/run/nodeagent
      name: policysync
    - mountPath: /sys/fs/bpf
      name: bpffs
    - mountPath: /var/log/calico/cni
      name: cni-log-dir
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-w28qx
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  initContainers:
  - command:
    - /opt/cni/bin/calico-ipam
    - -upgrade
    env:
    - name: KUBERNETES_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: CALICO_NETWORKING_BACKEND
      valueFrom:
        configMapKeyRef:
          key: calico_backend
          name: calico-config
    envFrom:
    - configMapRef:
        name: kubernetes-services-endpoint
        optional: true
    image: docker.io/calico/cni:v3.28.0
    imagePullPolicy: IfNotPresent
    name: upgrade-ipam
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/cni/networks
      name: host-local-net-dir
    - mountPath: /host/opt/cni/bin
      name: cni-bin-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-w28qx
      readOnly: true
  - command:
    - /opt/cni/bin/install
    env:
    - name: CNI_CONF_NAME
      value: 10-calico.conflist
    - name: CNI_NETWORK_CONFIG
      valueFrom:
        configMapKeyRef:
          key: cni_network_config
          name: calico-config
    - name: KUBERNETES_NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: CNI_MTU
      valueFrom:
        configMapKeyRef:
          key: veth_mtu
          name: calico-config
    - name: SLEEP
      value: "false"
    envFrom:
    - configMapRef:
        name: kubernetes-services-endpoint
        optional: true
    image: docker.io/calico/cni:v3.28.0
    imagePullPolicy: IfNotPresent
    name: install-cni
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host/opt/cni/bin
      name: cni-bin-dir
    - mountPath: /host/etc/cni/net.d
      name: cni-net-dir
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-w28qx
      readOnly: true
  - command:
    - calico-node
    - -init
    - -best-effort
    image: docker.io/calico/node:v3.28.0
    imagePullPolicy: IfNotPresent
    name: mount-bpffs
    resources: {}
    securityContext:
      privileged: true
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /sys/fs
      mountPropagation: Bidirectional
      name: sys-fs
    - mountPath: /var/run/calico
      mountPropagation: Bidirectional
      name: var-run-calico
    - mountPath: /nodeproc
      name: nodeproc
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-w28qx
      readOnly: true
  nodeName: node1
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: 2000001000
  priorityClassName: system-node-critical
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: calico-node
  serviceAccountName: calico-node
  terminationGracePeriodSeconds: 0
  tolerations:
  - effect: NoSchedule
    operator: Exists
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoExecute
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/network-unavailable
    operator: Exists
  volumes:
  - hostPath:
      path: /lib/modules
      type: ""
    name: lib-modules
  - hostPath:
      path: /var/run/calico
      type: ""
    name: var-run-calico
  - hostPath:
      path: /var/lib/calico
      type: ""
    name: var-lib-calico
  - hostPath:
      path: /run/xtables.lock
      type: FileOrCreate
    name: xtables-lock
  - hostPath:
      path: /sys/fs/
      type: DirectoryOrCreate
    name: sys-fs
  - hostPath:
      path: /sys/fs/bpf
      type: Directory
    name: bpffs
  - hostPath:
      path: /proc
      type: ""
    name: nodeproc
  - hostPath:
      path: /opt/cni/bin
      type: ""
    name: cni-bin-dir
  - hostPath:
      path: /etc/cni/net.d
      type: ""
    name: cni-net-dir
  - hostPath:
      path: /var/log/calico/cni
      type: ""
    name: cni-log-dir
  - hostPath:
      path: /var/lib/cni/networks
      type: ""
    name: host-local-net-dir
  - hostPath:
      path: /var/run/nodeagent
      type: DirectoryOrCreate
    name: policysync
  - name: kube-api-access-w28qx
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-09-22T09:21:10Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-09-22T09:23:01Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-09-22T09:23:01Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-07-25T11:09:44Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://e1fe61157237d46f9004e57ae53a97b2801ce1d421121fa15d67e1b2e1a94f93
    image: docker.io/calico/node:v3.28.0
    imageID: docker.io/calico/node@sha256:385bf6391fea031649b8575799248762a2caece86e6e3f33ffee19c0c096e6a8
    lastState:
      terminated:
        containerID: containerd://b3519095fa4d62c6c29b3309228a9b23e051a6357151815e72c5744600de0450
        exitCode: 0
        finishedAt: "2024-09-22T09:22:19Z"
        reason: Completed
        startedAt: "2024-09-22T09:21:12Z"
    name: calico-node
    ready: true
    restartCount: 17
    started: true
    state:
      running:
        startedAt: "2024-09-22T09:22:21Z"
  hostIP: 192.168.234.17
  initContainerStatuses:
  - containerID: containerd://be9ffd822b13ae25f060c7364d2f6694aaee1e517801573692a655e377fa3bc6
    image: docker.io/calico/cni:v3.28.0
    imageID: docker.io/calico/cni@sha256:cef0c907b8f4cadc63701d371e6f24d325795bcf0be84d6a517e33000ff35f70
    lastState: {}
    name: upgrade-ipam
    ready: true
    restartCount: 2
    state:
      terminated:
        containerID: containerd://be9ffd822b13ae25f060c7364d2f6694aaee1e517801573692a655e377fa3bc6
        exitCode: 0
        finishedAt: "2024-09-22T09:20:36Z"
        reason: Completed
        startedAt: "2024-09-22T09:20:33Z"
  - containerID: containerd://142f1a60440b3e9d1ef85ac2d6645b8386c04039323197ca17885edf77323989
    image: docker.io/calico/cni:v3.28.0
    imageID: docker.io/calico/cni@sha256:cef0c907b8f4cadc63701d371e6f24d325795bcf0be84d6a517e33000ff35f70
    lastState: {}
    name: install-cni
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://142f1a60440b3e9d1ef85ac2d6645b8386c04039323197ca17885edf77323989
        exitCode: 0
        finishedAt: "2024-09-22T09:20:53Z"
        reason: Completed
        startedAt: "2024-09-22T09:20:39Z"
  - containerID: containerd://39f0adc5c3f9c99866311258226905cffc9c6e8990ab21cfbe4c9cb7e74bcc72
    image: docker.io/calico/node:v3.28.0
    imageID: docker.io/calico/node@sha256:385bf6391fea031649b8575799248762a2caece86e6e3f33ffee19c0c096e6a8
    lastState: {}
    name: mount-bpffs
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://39f0adc5c3f9c99866311258226905cffc9c6e8990ab21cfbe4c9cb7e74bcc72
        exitCode: 0
        finishedAt: "2024-09-22T09:21:05Z"
        reason: Completed
        startedAt: "2024-09-22T09:20:59Z"
  phase: Running
  podIP: 192.168.234.17
  podIPs:
  - ip: 192.168.234.17
  qosClass: Burstable
  startTime: "2024-07-25T11:09:45Z"


journalctl --unit containerd --no-pager >containerd-unauthorized.log
containerd-unauthorized.log

k -n kube-system logs calico-node-58wxz > calico-node-node1-logs.log
calico-node-node1-logs.log

@utamas
Copy link

utamas commented Dec 30, 2024

rebooting solved my issue (my nodes are for testing and control plan was shut down for some days around holidays).

@jtackaberry
Copy link

I run into this on a fairly regular cadence, maybe 3-4 times per year on a small cluster of 4 nodes. Nuking the calico-node pod on the affected node and waiting for it to restart reliably allows the terminating pod to make progress, but is anyone aware of a permanent fix that doesn't require operator intervention?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet