Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico pods not ready/up after kubernetes node reboot #6687

Closed
vinayus opened this issue Sep 11, 2022 · 7 comments
Closed

calico pods not ready/up after kubernetes node reboot #6687

vinayus opened this issue Sep 11, 2022 · 7 comments

Comments

@vinayus
Copy link

vinayus commented Sep 11, 2022

We have a kubernetes cluster(3 masters and 9 worker nodes) and during maintenance cycle all VM's of the cluster will be powered off. When the VM's are powered on calico nodes and other pods are not healthy. I have updated calico from earlier 3.22 to 3.24.1 and also have tried on a clean cluster as well.

Pods are in createcontainererror and some log unauthorized messages in the container console.
As a workaround, I am restarting all calico pods which returns them into healthy state over a period.

I have been going through multiple issues such as #5910 but haven't found any luck. I added couple of extra clusterroles which didn't help. Calico.yaml manifest is from the official site with only image repository change since the cluster is without internet.

Kindly help

Expected Behavior

Pods should recover or restart themselves.

Current Behavior

Unhealthy pods observed in multiple namespaces

kubectl get pods -n kube-system -> https://justpaste.it/950st

kubectl logs -n kube-system calico-kube-controllers-74574fb497-f97c2 --tail=100 -> https://justpaste.it/9cplv

kubectl describe pod -n kube-system calico-kube-controllers-74574fb497-f97c2 -> https://justpaste.it/8h8zl

kubectl logs -n kube-system calico-node-6msxr -c install-cni --tail=100 -> https://justpaste.it/2sq4h

kubectl describe pod -n kube-system calico-node-6msxr -> https://justpaste.it/7qg8c

kubectl logs -n kube-system coredns-6c99bc4cf8-qdn6x -> https://justpaste.it/4kwd2

some direct logs captured

[atom@atom-cht-master1 ~]$ kubectl logs -n kube-system calico-node-g9jms --tail=100
2022-09-11 19:52:13.954 [INFO][9] startup/startup.go 425: Early log level set to info
2022-09-11 19:52:13.954 [INFO][9] startup/utils.go 127: Using NODENAME environment for node name atom-cht-master2
2022-09-11 19:52:13.954 [INFO][9] startup/utils.go 139: Determined node name: atom-cht-master2
2022-09-11 19:52:13.954 [INFO][9] startup/startup.go 94: Starting node atom-cht-master2 with version v3.24.1
2022-09-11 19:52:13.955 [INFO][9] startup/startup.go 430: Checking datastore connection
Calico node failed to start
2022-09-11 19:52:13.965 [WARNING][9] startup/startup.go 442: Connection to the datastore is unauthorized error=connection is unauthorized: Unauthorized
2022-09-11 19:52:13.966 [WARNING][9] startup/utils.go 49: Terminating

logs from other pod console where piraeus-op-cs is a service name

19:50:50.073 [Main] INFO  LINSTOR/Controller - SYSTEM - Initializing the etcd database
19:50:50.073 [Main] INFO  LINSTOR/Controller - SYSTEM - etcd connection URL is "etcd://piraeus-op-etcd:2379"
E0911 11:49:43.887576       1 leaderelection.go:330] error retrieving resource lock linstor/piraeus-op-cs: Unauthorized
E0911 11:49:45.890869       1 leaderelection.go:330] error retrieving resource lock linstor/piraeus-op-cs: Unauthorized
E0911 11:49:47.892569       1 leaderelection.go:330] error retrieving resource lock linstor/piraeus-op-cs: Unauthorized
E0911 11:49:49.890010       1 leaderelection.go:330] error retrieving resource lock linstor/piraeus-op-cs: Unauthorized
E0911 11:49:51.891151       1 leaderelection.go:330] error retrieving resource lock linstor/piraeus-op-cs: Unauthorized
I0911 11:49:53.885201       1 leaderelection.go:283] failed to renew lease linstor/piraeus-op-cs: timed out waiting for the condition
E0911 11:49:53.885406       1 leaderelection.go:306] Failed to release lock: resource name may not be empty
time="2022-09-11T11:49:53Z" level=info msg="lost leader status"
time="2022-09-11T11:49:53Z" level=fatal msg="failed to run" err="signal: killed"

some logs from /var/log/messages

Sep 11 08:55:49 atom-cht-master1 kubelet: I0911 08:55:49.706528     930 scope.go:110] "RemoveContainer" containerID="f4a876f1375ed07a05a5f40f6742841eafbc079392b28c1d2e595c97a629ef78"
Sep 11 08:55:49 atom-cht-master1 kubelet: I0911 08:55:49.707029     930 scope.go:110] "RemoveContainer" containerID="44a22b7169ef97cb4783b9053a58a3735e6ddb2649734ad5936a4392fded16bf"
Sep 11 08:55:49 atom-cht-master1 kubelet: I0911 08:55:49.716221     930 helpers.go:292] "Unable to create pod sandbox due to conflict. Attempting to remove sandbox" containerID="f4a876f1375ed07a05a5f40f6742841eafbc079392b28c1d2e595c97a629ef78"
Sep 11 08:55:49 atom-cht-master1 kubelet: E0911 08:55:49.717108     930 helpers.go:298] "Failed to remove the conflicting container" err="Error response from daemon: removal of container f4a876f1375ed07a05a5f40f6742841eafbc079392b28c1d2e595c97a629ef78 is already in progress" containerID="f4a876f1375ed07a05a5f40f6742841eafbc079392b28c1d2e595c97a629ef78"
Sep 11 08:55:49 atom-cht-master1 kubelet: E0911 08:55:49.717198     930 remote_runtime.go:394] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = Error response from daemon: Conflict. The container name \"/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5\" is already in use by container \"f4a876f1375ed07a05a5f40f6742841eafbc079392b28c1d2e595c97a629ef78\". You have to remove (or rename) that container to be able to reuse that name." podSandboxID="6c0d77e91e02f43daaf4b8cddcf0a20acfbf76deecfe73df2d04791d75627d23"
Sep 11 08:55:49 atom-cht-master1 kubelet: E0911 08:55:49.717292     930 kuberuntime_manager.go:919] container &Container{Name:piraeus-operator,Image:10.106.35.180/atom/piraeus-operator:v1.9.1,Command:[],Args:[--create-backups=true --create-monitoring=true],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:WATCH_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:POD_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:OPERATOR_NAME,Value:piraeus-op-operator,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{150 -3} {<nil>} 150m DecimalSI},memory: {{268435456 0} {<nil>}  BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:backup-dir,ReadOnly:false,MountPath:/run/linstor-backups,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:kube-api-access-gcf5b,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:{0 8081 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/readyz,Port:{0 8081 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod piraeus-op-operator-7c9bcdcf85-jw7rc_linstor(55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59): CreateContainerError: Error response from daemon: Conflict. The container name "/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5" is already in use by container "f4a876f1375ed07a05a5f40f6742841eafbc079392b28c1d2e595c97a629ef78". You have to remove (or rename) that container to be able to reuse that name.
Sep 11 08:55:49 atom-cht-master1 kubelet: E0911 08:55:49.717328     930 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"piraeus-operator\" with CreateContainerError: \"Error response from daemon: Conflict. The container name \\\"/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5\\\" is already in use by container \\\"f4a876f1375ed07a05a5f40f6742841eafbc079392b28c1d2e595c97a629ef78\\\". You have to remove (or rename) that container to be able to reuse that name.\"" pod="linstor/piraeus-op-operator-7c9bcdcf85-jw7rc" podUID=55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59
Sep 11 08:55:57 atom-cht-master1 systemd: [email protected] holdoff time over, scheduling restart.
Sep 11 08:55:57 atom-cht-master1 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.
Sep 11 08:55:57 atom-cht-master1 systemd: Stopped Serial Getty on ttyS0.
Sep 11 08:55:57 atom-cht-master1 systemd: Started Serial Getty on ttyS0.
Sep 11 08:56:04 atom-cht-master1 kubelet: I0911 08:56:04.920458     930 scope.go:110] "RemoveContainer" containerID="44a22b7169ef97cb4783b9053a58a3735e6ddb2649734ad5936a4392fded16bf"
Sep 11 08:56:04 atom-cht-master1 containerd: time="2022-09-11T08:56:04.979206307-04:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Sep 11 08:56:04 atom-cht-master1 containerd: time="2022-09-11T08:56:04.979303759-04:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Sep 11 08:56:04 atom-cht-master1 containerd: time="2022-09-11T08:56:04.979321920-04:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Sep 11 08:56:04 atom-cht-master1 containerd: time="2022-09-11T08:56:04.979493834-04:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814 pid=78875 runtime=io.containerd.runc.v2
Sep 11 08:56:05 atom-cht-master1 systemd: Started libcontainer container 11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814.
Sep 11 08:56:05 atom-cht-master1 kubelet: I0911 08:56:05.115968     930 scope.go:110] "RemoveContainer" containerID="44a22b7169ef97cb4783b9053a58a3735e6ddb2649734ad5936a4392fded16bf"
Sep 11 08:56:05 atom-cht-master1 kubelet: I0911 08:56:05.124451     930 helpers.go:292] "Unable to create pod sandbox due to conflict. Attempting to remove sandbox" containerID="11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814"
Sep 11 08:56:05 atom-cht-master1 dockerd: time="2022-09-11T08:56:05.134148777-04:00" level=info msg="ignoring event" container=11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 11 08:56:05 atom-cht-master1 containerd: time="2022-09-11T08:56:05.134268531-04:00" level=info msg="shim disconnected" id=11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814
Sep 11 08:56:05 atom-cht-master1 containerd: time="2022-09-11T08:56:05.134321058-04:00" level=warning msg="cleaning up after shim disconnected" id=11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814 namespace=moby
Sep 11 08:56:05 atom-cht-master1 containerd: time="2022-09-11T08:56:05.134330576-04:00" level=info msg="cleaning up dead shim"
Sep 11 08:56:05 atom-cht-master1 containerd: time="2022-09-11T08:56:05.145792700-04:00" level=warning msg="cleanup warnings time=\"2022-09-11T08:56:05-04:00\" level=info msg=\"starting signal loop\" namespace=moby pid=78921 runtime=io.containerd.runc.v2\n"
Sep 11 08:56:05 atom-cht-master1 kubelet: E0911 08:56:05.161266     930 remote_runtime.go:394] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = Error response from daemon: Conflict. The container name \"/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5\" is already in use by container \"11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814\". You have to remove (or rename) that container to be able to reuse that name." podSandboxID="6c0d77e91e02f43daaf4b8cddcf0a20acfbf76deecfe73df2d04791d75627d23"
Sep 11 08:56:05 atom-cht-master1 kubelet: E0911 08:56:05.161379     930 kuberuntime_manager.go:919] container &Container{Name:piraeus-operator,Image:10.106.35.180/atom/piraeus-operator:v1.9.1,Command:[],Args:[--create-backups=true --create-monitoring=true],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:WATCH_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:POD_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:OPERATOR_NAME,Value:piraeus-op-operator,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{150 -3} {<nil>} 150m DecimalSI},memory: {{268435456 0} {<nil>}  BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:backup-dir,ReadOnly:false,MountPath:/run/linstor-backups,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:kube-api-access-gcf5b,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:{0 8081 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/readyz,Port:{0 8081 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod piraeus-op-operator-7c9bcdcf85-jw7rc_linstor(55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59): CreateContainerError: Error response from daemon: Conflict. The container name "/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5" is already in use by container "11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814". You have to remove (or rename) that container to be able to reuse that name.
Sep 11 08:56:05 atom-cht-master1 kubelet: E0911 08:56:05.161419     930 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"piraeus-operator\" with CreateContainerError: \"Error response from daemon: Conflict. The container name \\\"/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5\\\" is already in use by container \\\"11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814\\\". You have to remove (or rename) that container to be able to reuse that name.\"" pod="linstor/piraeus-op-operator-7c9bcdcf85-jw7rc" podUID=55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59
Sep 11 08:56:06 atom-cht-master1 kubelet: I0911 08:56:06.157943     930 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814"
Sep 11 08:56:06 atom-cht-master1 kubelet: I0911 08:56:06.158636     930 scope.go:110] "RemoveContainer" containerID="44a22b7169ef97cb4783b9053a58a3735e6ddb2649734ad5936a4392fded16bf"
Sep 11 08:56:06 atom-cht-master1 containerd: time="2022-09-11T08:56:06.217327964-04:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Sep 11 08:56:06 atom-cht-master1 containerd: time="2022-09-11T08:56:06.217417564-04:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Sep 11 08:56:06 atom-cht-master1 containerd: time="2022-09-11T08:56:06.217436523-04:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Sep 11 08:56:06 atom-cht-master1 containerd: time="2022-09-11T08:56:06.217631348-04:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943 pid=78977 runtime=io.containerd.runc.v2
Sep 11 08:56:06 atom-cht-master1 systemd: Started libcontainer container 39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943.
Sep 11 08:56:06 atom-cht-master1 kubelet: I0911 08:56:06.332742     930 docker_container.go:247] "Deleted previously existing symlink file" path="/var/log/pods/linstor_piraeus-op-operator-7c9bcdcf85-jw7rc_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59/piraeus-operator/5.log"
Sep 11 08:56:06 atom-cht-master1 dockerd: time="2022-09-11T08:56:06.397817331-04:00" level=info msg="ignoring event" container=39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 11 08:56:06 atom-cht-master1 containerd: time="2022-09-11T08:56:06.397993342-04:00" level=info msg="shim disconnected" id=39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943
Sep 11 08:56:06 atom-cht-master1 containerd: time="2022-09-11T08:56:06.398042617-04:00" level=warning msg="cleaning up after shim disconnected" id=39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943 namespace=moby
Sep 11 08:56:06 atom-cht-master1 containerd: time="2022-09-11T08:56:06.398055173-04:00" level=info msg="cleaning up dead shim"
Sep 11 08:56:06 atom-cht-master1 containerd: time="2022-09-11T08:56:06.417121375-04:00" level=warning msg="cleanup warnings time=\"2022-09-11T08:56:06-04:00\" level=info msg=\"starting signal loop\" namespace=moby pid=79023 runtime=io.containerd.runc.v2\n"
Sep 11 08:56:07 atom-cht-master1 kubelet: I0911 08:56:07.197723     930 scope.go:110] "RemoveContainer" containerID="39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943"
Sep 11 08:56:07 atom-cht-master1 kubelet: I0911 08:56:07.198030     930 scope.go:110] "RemoveContainer" containerID="44a22b7169ef97cb4783b9053a58a3735e6ddb2649734ad5936a4392fded16bf"
Sep 11 08:56:07 atom-cht-master1 kubelet: I0911 08:56:07.202611     930 helpers.go:292] "Unable to create pod sandbox due to conflict. Attempting to remove sandbox" containerID="39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943"
Sep 11 08:56:07 atom-cht-master1 kubelet: E0911 08:56:07.202935     930 helpers.go:298] "Failed to remove the conflicting container" err="Error response from daemon: removal of container 39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943 is already in progress" containerID="39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943"
Sep 11 08:56:07 atom-cht-master1 kubelet: E0911 08:56:07.203057     930 remote_runtime.go:394] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = Error response from daemon: Conflict. The container name \"/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5\" is already in use by container \"39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943\". You have to remove (or rename) that container to be able to reuse that name." podSandboxID="6c0d77e91e02f43daaf4b8cddcf0a20acfbf76deecfe73df2d04791d75627d23"
Sep 11 08:56:07 atom-cht-master1 kubelet: E0911 08:56:07.203161     930 kuberuntime_manager.go:919] container &Container{Name:piraeus-operator,Image:10.106.35.180/atom/piraeus-operator:v1.9.1,Command:[],Args:[--create-backups=true --create-monitoring=true],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:WATCH_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:POD_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:OPERATOR_NAME,Value:piraeus-op-operator,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{150 -3} {<nil>} 150m DecimalSI},memory: {{268435456 0} {<nil>}  BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:backup-dir,ReadOnly:false,MountPath:/run/linstor-backups,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:kube-api-access-gcf5b,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:{0 8081 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/readyz,Port:{0 8081 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod piraeus-op-operator-7c9bcdcf85-jw7rc_linstor(55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59): CreateContainerError: Error response from daemon: Conflict. The container name "/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5" is already in use by container "39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943". You have to remove (or rename) that container to be able to reuse that name.
Sep 11 08:56:07 atom-cht-master1 kubelet: E0911 08:56:07.203200     930 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"piraeus-operator\" with CreateContainerError: \"Error response from daemon: Conflict. The container name \\\"/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5\\\" is already in use by container \\\"39e9a07da3e2b2c25c68cc0b9fea54b4683af55ecb01d817026f995298e98943\\\". You have to remove (or rename) that container to be able to reuse that name.\"" pod="linstor/piraeus-op-operator-7c9bcdcf85-jw7rc" podUID=55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59
Sep 11 08:56:07 atom-cht-master1 systemd: [email protected] holdoff time over, scheduling restart.
Sep 11 08:56:07 atom-cht-master1 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.
Sep 11 08:56:07 atom-cht-master1 systemd: Stopped Serial Getty on ttyS0.
Sep 11 08:56:07 atom-cht-master1 systemd: Started Serial Getty on ttyS0.
Sep 11 08:56:09 atom-cht-master1 kubelet: E0911 08:56:09.027608     930 remote_runtime.go:572] "ContainerStatus from runtime service failed" err="rpc error: code = Unknown desc = Error: No such container: 11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814" containerID="11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814"
Sep 11 08:56:09 atom-cht-master1 kubelet: I0911 08:56:09.027680     930 kuberuntime_gc.go:361] "Error getting ContainerStatus for containerID" containerID="11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814" err="rpc error: code = Unknown desc = Error: No such container: 11482b9af97944398f3446f86b8e85c5e7acaa96e33b8e637a543a074a4bf814"
Sep 11 08:56:17 atom-cht-master1 systemd: [email protected] holdoff time over, scheduling restart.
Sep 11 08:56:17 atom-cht-master1 systemd: Cannot add dependency job for unit lvm2-lvmetad.socket, ignoring: Unit is masked.
Sep 11 08:56:17 atom-cht-master1 systemd: Stopped Serial Getty on ttyS0.
Sep 11 08:56:17 atom-cht-master1 systemd: Started Serial Getty on ttyS0.
Sep 11 08:56:19 atom-cht-master1 kubelet: E0911 08:56:19.187938     930 server.go:274] "Unable to authenticate the request due to an error" err="[invalid bearer token, service account token is not valid yet]"
Sep 11 08:56:21 atom-cht-master1 kubelet: I0911 08:56:21.920078     930 scope.go:110] "RemoveContainer" containerID="44a22b7169ef97cb4783b9053a58a3735e6ddb2649734ad5936a4392fded16bf"
Sep 11 08:56:21 atom-cht-master1 containerd: time="2022-09-11T08:56:21.962649630-04:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Sep 11 08:56:21 atom-cht-master1 containerd: time="2022-09-11T08:56:21.962800509-04:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Sep 11 08:56:21 atom-cht-master1 containerd: time="2022-09-11T08:56:21.962817699-04:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Sep 11 08:56:21 atom-cht-master1 containerd: time="2022-09-11T08:56:21.963051107-04:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/b19908888c494027a1427f9ef222629123c54df6f027d820cccabcf347134699 pid=79241 runtime=io.containerd.runc.v2
Sep 11 08:56:21 atom-cht-master1 systemd: Started libcontainer container b19908888c494027a1427f9ef222629123c54df6f027d820cccabcf347134699.
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.099523805-04:00" level=info msg="shim disconnected" id=b19908888c494027a1427f9ef222629123c54df6f027d820cccabcf347134699
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.099677070-04:00" level=warning msg="cleaning up after shim disconnected" id=b19908888c494027a1427f9ef222629123c54df6f027d820cccabcf347134699 namespace=moby
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.100392880-04:00" level=info msg="cleaning up dead shim"
Sep 11 08:56:22 atom-cht-master1 dockerd: time="2022-09-11T08:56:22.100507435-04:00" level=info msg="ignoring event" container=b19908888c494027a1427f9ef222629123c54df6f027d820cccabcf347134699 module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.113372218-04:00" level=warning msg="cleanup warnings time=\"2022-09-11T08:56:22-04:00\" level=info msg=\"starting signal loop\" namespace=moby pid=79287 runtime=io.containerd.runc.v2\n"
Sep 11 08:56:22 atom-cht-master1 kubelet: I0911 08:56:22.581021     930 scope.go:110] "RemoveContainer" containerID="b19908888c494027a1427f9ef222629123c54df6f027d820cccabcf347134699"
Sep 11 08:56:22 atom-cht-master1 kubelet: I0911 08:56:22.581586     930 scope.go:110] "RemoveContainer" containerID="44a22b7169ef97cb4783b9053a58a3735e6ddb2649734ad5936a4392fded16bf"
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.637499554-04:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.637575832-04:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.637589834-04:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.637765706-04:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd pid=79315 runtime=io.containerd.runc.v2
Sep 11 08:56:22 atom-cht-master1 systemd: Started libcontainer container 3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd.
Sep 11 08:56:22 atom-cht-master1 dockerd: time="2022-09-11T08:56:22.823959884-04:00" level=info msg="ignoring event" container=3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.824104805-04:00" level=info msg="shim disconnected" id=3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.824153879-04:00" level=warning msg="cleaning up after shim disconnected" id=3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd namespace=moby
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.824165054-04:00" level=info msg="cleaning up dead shim"
Sep 11 08:56:22 atom-cht-master1 containerd: time="2022-09-11T08:56:22.843151517-04:00" level=warning msg="cleanup warnings time=\"2022-09-11T08:56:22-04:00\" level=info msg=\"starting signal loop\" namespace=moby pid=79361 runtime=io.containerd.runc.v2\n"
Sep 11 08:56:23 atom-cht-master1 kubelet: I0911 08:56:23.628113     930 scope.go:110] "RemoveContainer" containerID="3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd"
Sep 11 08:56:23 atom-cht-master1 kubelet: I0911 08:56:23.628831     930 scope.go:110] "RemoveContainer" containerID="44a22b7169ef97cb4783b9053a58a3735e6ddb2649734ad5936a4392fded16bf"
Sep 11 08:56:23 atom-cht-master1 kubelet: I0911 08:56:23.637960     930 helpers.go:292] "Unable to create pod sandbox due to conflict. Attempting to remove sandbox" containerID="3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd"
Sep 11 08:56:23 atom-cht-master1 kubelet: E0911 08:56:23.641977     930 remote_runtime.go:505] "RemoveContainer from runtime service failed" err="rpc error: code = Unknown desc = failed to remove container \"3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd\": Error response from daemon: removal of container 3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd is already in progress" containerID="3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd"
Sep 11 08:56:23 atom-cht-master1 kubelet: I0911 08:56:23.642030     930 pod_container_deletor.go:52] "DeleteContainer returned error" containerID={Type:docker ID:3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd} err="rpc error: code = Unknown desc = failed to remove container \"3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd\": Error response from daemon: removal of container 3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd is already in progress"
Sep 11 08:56:23 atom-cht-master1 kubelet: E0911 08:56:23.642875     930 remote_runtime.go:394] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = Error response from daemon: Conflict. The container name \"/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5\" is already in use by container \"3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd\". You have to remove (or rename) that container to be able to reuse that name." podSandboxID="6c0d77e91e02f43daaf4b8cddcf0a20acfbf76deecfe73df2d04791d75627d23"
Sep 11 08:56:23 atom-cht-master1 kubelet: E0911 08:56:23.643006     930 kuberuntime_manager.go:919] container &Container{Name:piraeus-operator,Image:10.106.35.180/atom/piraeus-operator:v1.9.1,Command:[],Args:[--create-backups=true --create-monitoring=true],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:WATCH_NAMESPACE,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:POD_NAME,Value:,ValueFrom:&EnvVarSource{FieldRef:&ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.name,},ResourceFieldRef:nil,ConfigMapKeyRef:nil,SecretKeyRef:nil,},},EnvVar{Name:OPERATOR_NAME,Value:piraeus-op-operator,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{cpu: {{150 -3} {<nil>} 150m DecimalSI},memory: {{268435456 0} {<nil>}  BinarySI},},},VolumeMounts:[]VolumeMount{VolumeMount{Name:backup-dir,ReadOnly:false,MountPath:/run/linstor-backups,SubPath:,MountPropagation:nil,SubPathExpr:,},VolumeMount{Name:kube-api-access-gcf5b,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/healthz,Port:{0 8081 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},ReadinessProbe:&Probe{ProbeHandler:ProbeHandler{Exec:nil,HTTPGet:&HTTPGetAction{Path:/readyz,Port:{0 8081 },Host:,Scheme:HTTP,HTTPHeaders:[]HTTPHeader{},},TCPSocket:nil,GRPC:nil,},InitialDelaySeconds:0,TimeoutSeconds:1,PeriodSeconds:10,SuccessThreshold:1,FailureThreshold:3,TerminationGracePeriodSeconds:nil,},Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,} start failed in pod piraeus-op-operator-7c9bcdcf85-jw7rc_linstor(55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59): CreateContainerError: Error response from daemon: Conflict. The container name "/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5" is already in use by container "3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd". You have to remove (or rename) that container to be able to reuse that name.
Sep 11 08:56:23 atom-cht-master1 kubelet: E0911 08:56:23.643058     930 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"piraeus-operator\" with CreateContainerError: \"Error response from daemon: Conflict. The container name \\\"/k8s_piraeus-operator_piraeus-op-operator-7c9bcdcf85-jw7rc_linstor_55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59_5\\\" is already in use by container \\\"3ab81c3e10bcdc49914dcd29507b50bdec7243e05303b8ed9052efabf6b460dd\\\". You have to remove (or rename) that container to be able to reuse that name.\"" pod="linstor/piraeus-op-operator-7c9bcdcf85-jw7rc" podUID=55f1efa5-ee79-4ca9-9aa6-27a8a1b79d59

sysctl -p output

[atom@atom-cht-master1 log]$ sudo sysctl -p
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-arptables = 1
net.ipv4.ip_forward = 1
kernel.pid_max = 100000
user.max_user_namespaces = 15000
vm.max_map_count = 262144
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.conf.all.rp_filter = 0

Possible Solution

As a workaround, I'm currently using kubectl delete pod --all --all-namespaces --force or restarting calico pods alone. Looking for suggestions on this.

Steps to Reproduce (for bugs)

  1. reboot all nodes of kubernetes cluster

Context

This has added additional steps to recover the setup by force killing all pods.

Your Environment

  • Calico version 3.24.1 without any changes to official manifest
  • Orchestrator version kubernetes 1.22.12 and also kubernetes 1.23.10 created via kubeadm
  • Operating System and version: CentOS Linux release 7.9.2009
  • Nodes: Single interface with static ip's and local NTP in sync without any firewall running. Very limited/no internet. All images are downloaded by neighboring docker registry
  • Link to your project (optional):
@caseydavenport
Copy link
Member

2022-09-11 19:52:00.916 [ERROR][1] cni-installer/ : Unable to create token for CNI kubeconfig error=Unauthorized

Looks like it is attempting to get a token, but is being refused. Something to check here would be your apiserver logs to see why it is rejecting the TokenRequest.

@vinayus
Copy link
Author

vinayus commented Sep 15, 2022

Here is the apiserver logs..
master1 --> https://justpaste.it/82l2p
master2 --> https://justpaste.it/1zix4
master3 --> https://justpaste.it/5vrdc

kubeproxy basically has only unauthorized messages, one of them is here --> https://justpaste.it/9p381

not sure if this helps, but this is how cluster is currently created..

    KUBEADM_CERT_KEY=$(kubeadm certs certificate-key)
    sudo kubeadm init phase upload-certs --upload-certs --certificate-key=$KUBEADM_CERT_KEY
    JOIN_CMD=$(sudo kubeadm token create --print-join-command --certificate-key $KUBEADM_CERT_KEY)

which i guess should have same signing for all apiservers right?

@lmm
Copy link
Contributor

lmm commented Sep 20, 2022

I see that the apiserver logs are complaining about the token:

E0915 12:46:35.231754       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has expired]"
E0915 12:46:35.668872       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token is not valid yet]"

which looks like two different tokens are being used: https://github.com/kubernetes/kubernetes/blob/03c76decb2c2e74b2bdd9e46f15c84235f6c6cd3/pkg/serviceaccount/claims.go#L124-L129

I'm not exactly sure what that means. WDYT @caseydavenport ?

also have tried on a clean cluster as well.

@vinayus did you try the reboot on a clean v3.24.x cluster?

@vinayus
Copy link
Author

vinayus commented Sep 20, 2022

hi @lmm if you referring to VM reboot then yes and that is how it is reproduced.
if on new install(3.24) on a fresh cluster then no. Sadly the cluster was created with 3.22.1 and updated to 3.24.1

upgrade was done using kubectl apply -f calico-3.24.yaml
there was once where we deleted and applied too

@caseydavenport
Copy link
Member

If kube-proxy is experiencing the same issue then this is definitely not a Calico problem, but a problem with the serviceaccount token signing or distribution infrastructure for the cluster.

Somehow the tokens being provided to clients on that node (kube-proxy and Calico) are outdated.

VM reboot causing the issue points to perhaps something that is invalidating the node's tokens on shutdown?

@lmm I'm guessing the two tokens are from different components, since it seems like more than just Calico is hitting this issue.

@vinayus
Copy link
Author

vinayus commented Sep 20, 2022

I don't see any errors in pod description or logs of kube-proxy.. They look like below atleast now..

│ I0921 02:10:07.427366 1 node.go:172] Successfully retrieved node IP: 10.106.35.189 │ │ I0921 02:10:07.427506 1 server_others.go:140] Detected node IP 10.106.35.189 │ │ I0921 02:10:07.497304 1 server_others.go:206] kube-proxy running in dual-stack mode, IPv4-primary │ │ I0921 02:10:07.497376 1 server_others.go:274] Using ipvs Proxier. │ │ I0921 02:10:07.497397 1 server_others.go:276] creating dualStackProxier for ipvs. │ │ W0921 02:10:07.497429 1 server_others.go:495] detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR define │ │ d, , defaulting to no-op detect-local for IPv6 │ │ E0921 02:10:07.497889 1 proxier.go:379] "can't set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at le │ │ ast 4.1" │ │ I0921 02:10:07.498093 1 proxier.go:438] "IPVS scheduler not specified, use rr by default" │ │ E0921 02:10:07.498262 1 proxier.go:379] "can't set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at le │ │ ast 4.1" │ │ I0921 02:10:07.498424 1 proxier.go:438] "IPVS scheduler not specified, use rr by default" │ │ W0921 02:10:07.498459 1 ipset.go:113] ipset name truncated; [KUBE-6-LOAD-BALANCER-SOURCE-CIDR] -> [KUBE-6-LOAD-BALA │ │ CER-SOURCE-CID[] │ │ W0921 02:10:07.498495 1 ipset.go:113] ipset name truncated; [KUBE-6-NODE-PORT-LOCAL-SCTP-HASH] -> [KUBE-6-NODE-PORT │ │ LOCAL-SCTP-HAS[] │ │ I0921 02:10:07.498774 1 server.go:649] Version: v1.22.12 │ │ I0921 02:10:07.504573 1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072 │ │ I0921 02:10:07.504615 1 conntrack.go:52] Setting nf_conntrack_max to 131072 │ │ I0921 02:10:07.504682 1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400 │ │ I0921 02:10:07.504747 1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600 │ │ I0921 02:10:07.505950 1 config.go:315] Starting service config controller │ │ I0921 02:10:07.505969 1 shared_informer.go:240] Waiting for caches to sync for service config │ │ I0921 02:10:07.506040 1 config.go:224] Starting endpoint slice config controller │ │ I0921 02:10:07.506043 1 shared_informer.go:240] Waiting for caches to sync for endpoint slice config │ │ I0921 02:10:07.606262 1 shared_informer.go:247] Caches are synced for endpoint slice config │ │ I0921 02:10:07.606415 1 shared_informer.go:247] Caches are synced for service config

Just couple of times out of 20 trials, we had a proper recover without any failures. We tried to narrow down the difference and so far unlucky.

One observation is few calico-node pods are ready(1/1) but in logs post unauthorized messages like below.

│ calico-node 2022-09-20 18:13:22.921 [INFO][70] status-reporter/watchercache.go 194: Failed to perform list of current dat │ │ a during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=connection is unauthorized: Un │ │ authorized │ │ calico-node 2022-09-20 18:13:23.921 [INFO][70] status-reporter/watchercache.go 181: Full resync is required ListRoot="/ca │ │ lico/resources/v3/projectcalico.org/caliconodestatuses" │ │ calico-node 2022-09-20 18:13:23.924 [INFO][70] status-reporter/watchercache.go 194: Failed to perform list of current dat │ │ a during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=connection is unauthorized: Un │ │ authorized │ │ calico-node 2022-09-20 18:13:24.924 [INFO][70] status-reporter/watchercache.go 181: Full resync is required ListRoot="/ca │ │ lico/resources/v3/projectcalico.org/caliconodestatuses" │ │ calico-node 2022-09-20 18:13:24.928 [INFO][70] status-reporter/watchercache.go 194: Failed to perform list of current dat │ │ a during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=connection is unauthorized: Un │ │ authorized │ │ calico-node 2022-09-20 18:13:25.928 [INFO][70] status-reporter/watchercache.go 181: Full resync is required ListRoot="/ca │ │ lico/resources/v3/projectcalico.org/caliconodestatuses" │ │ calico-node 2022-09-20 18:13:25.932 [INFO][70] status-reporter/watchercache.go 194: Failed to perform list of current dat │ │ a during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=connection is unauthorized: Un │ │ authorized │ │ calico-node 2022-09-20 18:13:26.932 [INFO][70] status-reporter/watchercache.go 181: Full resync is required ListRoot="/ca │ │ lico/resources/v3/projectcalico.org/caliconodestatuses" │ │ calico-node 2022-09-20 18:13:26.936 [INFO][70] status-reporter/watchercache.go 194: Failed to perform list of current dat │ │ a during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=connection is unauthorized: Un │ │ authorized │ │ calico-node 2022-09-20 18:13:27.936 [INFO][70] status-reporter/watchercache.go 181: Full resync is required ListRoot="/ca │ │ lico/resources/v3/projectcalico.org/caliconodestatuses" │ │ calico-node 2022-09-20 18:13:27.939 [INFO][70] status-reporter/watchercache.go 194: Failed to perform list of current dat │ │ a during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=connection is unauthorized: Un │ │ authorized

Under apiserver(verbose logging) we do see unauthorized issue for GET calls which seems related maybe?

│ E0920 18:20:17.445808 1 claims.go:126] unexpected validation error: *errors.errorString │ │ E0920 18:20:17.446015 1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Toke │ │ could not be validated.[]" │ │ I0920 18:20:17.446168 1 httplog.go:109] "HTTP" verb="GET" URI="/apis/crd.projectcalico.org/v1/clusterinformations/d │ │ efault" latency="1.298888ms" userAgent="Go-http-client/2.0" audit-ID="dc460718-1009-4582-b01b-3d9e4635e74c" srcIP="10.106 │ │ .35.185:7144" resp=401

E0920 18:20:13.574303 1 claims.go:126] unexpected validation error: *errors.errorString │ │ E0920 18:20:13.574439 1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, Toke │ │ could not be validated.[]" │ │ I0920 18:20:13.574557 1 httplog.go:109] "HTTP" verb="LIST" URI="/apis/discovery.k8s.io/v1beta1/endpointslices?resou │ │ rceVersion=152753944" latency="936.348µs" userAgent="speaker/v0.0.0 (linux/amd64) kubernetes/$Format" audit-ID="c54970d3- │ │ 11e4-4521-8d66-a2481fe3857f" srcIP="10.106.35.185:64094" resp=401

If there is any other info I can capture please do ask, will try to gather as much as possible

@vinayus
Copy link
Author

vinayus commented Sep 28, 2022

Looks like we figured out the issue. The main root cause seems to be ntp sync between hypervisor(esxi in our case) and these vm's.
The hypervisor was force-setting a time during boot up which was incorrect and since our ntp clients were taking bit more time than docker and kubelet services, we ended up with invalid timestamp for tokens.

Journalctl logs of ntpd gave us that slew was -28000s which maps to hypervisor's time which is way off. Sadly the hypervisor time was manually set and not in sync with any public time servers and all underlying vm's were affected. We overlooked because we always ended taking time check couple of minutes later by then time would've been in sync.

As a simple precaution, I added dependencies to docker and kublet systemd files to wait for ntp service to be properly up before starting and inside ntp systemd service, we added a force immediate sync(ntpd -u ntp:ntp -g -x -q) as ExecPre condition.

This was not an issue with calico or kubernetes sadly. Apologies for incorrect issue. Please close the issue if nobody else is having this case.

I have cross-checked this on both kubernetes 1.22.12 and 1.23.10 over calico 3.24.1
Thanks for helping with pointers. Much appreciated 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants