'failed to reserve sandbox name' error after hard reboot #1014

steven-sheehy · 2019-01-02T17:27:36Z

After a VM running Kubernetes became completely unresponsive, I had to forcefully restart the VM. Upon reboot, containerd fails to start due to the below error:

Jan 02 17:13:12 node1 systemd[1]: Starting containerd container runtime...
Jan 02 17:13:12 node1 systemd[1]: Started containerd container runtime.
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.977550870Z" level=info msg="starting containerd" revision=9b32062dc1f5a7c2564315c269b5059754f12b9d version=v1.2.1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.978484921Z" level=info msg="loading plugin "io.containerd.content.v1.content"..." type=io.containerd.content.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.978553204Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." type=io.containerd.snapshotter.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.978900731Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.btrfs" error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.979456699Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.aufs"..." type=io.containerd.snapshotter.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.982381225Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.982466667Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.982672547Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983064870Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.zfs" error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983127339Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983169107Z" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983191913Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983546460Z" level=info msg="loading plugin "io.containerd.differ.v1.walking"..." type=io.containerd.differ.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983593858Z" level=info msg="loading plugin "io.containerd.gc.v1.scheduler"..." type=io.containerd.gc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983686180Z" level=info msg="loading plugin "io.containerd.service.v1.containers-service"..." type=io.containerd.service.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983722181Z" level=info msg="loading plugin "io.containerd.service.v1.content-service"..." type=io.containerd.service.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983757936Z" level=info msg="loading plugin "io.containerd.service.v1.diff-service"..." type=io.containerd.service.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983789780Z" level=info msg="loading plugin "io.containerd.service.v1.images-service"..." type=io.containerd.service.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983820969Z" level=info msg="loading plugin "io.containerd.service.v1.leases-service"..." type=io.containerd.service.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983852236Z" level=info msg="loading plugin "io.containerd.service.v1.namespaces-service"..." type=io.containerd.service.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983883538Z" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.983914027Z" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.984036581Z" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.984143655Z" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.984907233Z" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.984971985Z" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985063262Z" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985100153Z" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985130588Z" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985174346Z" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985205031Z" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985235589Z" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985264565Z" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985294177Z" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985323733Z" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985410854Z" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985448316Z" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985478964Z" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985508762Z" level=info msg="loading plugin "io.containerd.grpc.v1.cri"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985595695Z" level=info msg="Start cri plugin with config {PluginConfig:{ContainerdConfig:{Snapshotter:overlayfs DefaultRuntime:{Type:io.containerd.runtime.v1.linux Engine: Root: Options:<nil>} UntrustedWorkloadRuntime:{Type: Engine: Root: Options:<nil>} Runtimes:map[] NoPivot:false} CniConfig:{NetworkPluginBinDir:/opt/cni/bin NetworkPluginConfDir:/etc/cni/net.d NetworkPluginConfTemplate:} Registry:{Mirrors:map[docker.io:{Endpoints:[https://registry-1.docker.io]}] Auths:map[]} StreamServerAddress:127.0.0.1 StreamServerPort:0 EnableSelinux:false SandboxImage:k8s.gcr.io/pause:3.1 StatsCollectPeriod:10 SystemdCgroup:false EnableTLSStreaming:false X509KeyPairStreaming:{TLSCertFile: TLSKeyFile:} MaxContainerLogLineSize:16384} ContainerdRootDir:/var/lib/containerd ContainerdEndpoint:/run/containerd/containerd.sock RootDir:/var/lib/containerd/io.containerd.grpc.v1.cri StateDir:/run/containerd/io.containerd.grpc.v1.cri}"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.985709970Z" level=info msg="Connect containerd service"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.986385131Z" level=info msg="Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs""
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.987497830Z" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.987993570Z" level=info msg=serving... address="/run/containerd/containerd.sock"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.988296640Z" level=info msg="Start subscribing containerd event"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.988720026Z" level=info msg="Start recovering state"
Jan 02 17:13:12 node1 containerd[80463]: time="2019-01-02T17:13:12.989711956Z" level=info msg="containerd successfully booted in 0.012843s"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.089144500Z" level=debug msg="Loaded sandbox {Metadata:{ID:02ac99a0d183c1e913c228811d101b217d5286f9086e0a0316cf0ee437348b6d Name:grafana-769bc56dd-wwlkz_production_17dc6382-03d8-11e9-97aa-005056911476_0 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:grafana-769bc56dd-wwlkz,Uid:17dc6382-03d8-11e9-97aa-005056911476,Namespace:production,Attempt:0,},Hostname:grafana-769bc56dd-wwlkz,LogDirectory:/var/log/pods/17dc6382-03d8-11e9-97aa-005056911476,DnsConfig:&DNSConfig{Servers:[10.96.0.10],Searches:[production.svc.cluster.local svc.cluster.local cluster.local firescope.int],Options:[ndots:5],},PortMappings:[&PortMapping{Protocol:TCP,ContainerPort:9001,HostPort:0,HostIp:,} &PortMapping{Protocol:TCP,ContainerPort:3000,HostPort:0,HostIp:,}],Labels:map[string]string{app: grafana,io.kubernetes.pod.name: grafana-769bc56dd-wwlkz,io.kubernetes.pod.namespace: production,io.kubernetes.pod.uid: 17dc6382-03d8-11e9-97aa-005056911476,pod-template-hash: 325671288,release: edge,},Annotations:map[string]string{kubernetes.io/config.seen: 2018-12-19T21:50:26.710123356Z,kubernetes.io/config.source: api,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod17dc6382-03d8-11e9-97aa-005056911476,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:&Int64Value{Value:472,},ReadonlyRootfs:false,SupplementalGroups:[472],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath:/var/run/netns/cni-cac8ebaf-436e-54fb-8fe8-14768ce89b60 IP:10.214.128.56 RuntimeHandler:} Status:0xc42003dfc0 Container:0xc4200ece60 NetNS:0xc42039a6a0 StopCh:0xc42001a120}"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.089713206Z" level=debug msg="Loaded sandbox {Metadata:{ID:02b3441253939c078d100ac6a5a5c63e5bb308d5904c36b3903ddd6b800e5c07 Name:prometheus-85dfb696c7-sm9g8_production_2302e97a-03d2-11e9-9447-005056911476_0 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:prometheus-85dfb696c7-sm9g8,Uid:2302e97a-03d2-11e9-9447-005056911476,Namespace:production,Attempt:0,},Hostname:prometheus-85dfb696c7-sm9g8,LogDirectory:/var/log/pods/2302e97a-03d2-11e9-9447-005056911476,DnsConfig:&DNSConfig{Servers:[10.96.0.10],Searches:[production.svc.cluster.local svc.cluster.local cluster.local firescope.int],Options:[ndots:5],},PortMappings:[&PortMapping{Protocol:TCP,ContainerPort:9090,HostPort:0,HostIp:,}],Labels:map[string]string{app: prometheus,chart: prometheus-8.1.2,component: server,heritage: Tiller,io.kubernetes.pod.name: prometheus-85dfb696c7-sm9g8,io.kubernetes.pod.namespace: production,io.kubernetes.pod.uid: 2302e97a-03d2-11e9-9447-005056911476,pod-template-hash: 4189625273,release: edge,},Annotations:map[string]string{kubernetes.io/config.seen: 2018-12-19T21:23:11.535355976Z,kubernetes.io/config.source: api,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod2302e97a-03d2-11e9-9447-005056911476,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath:/var/run/netns/cni-f876ceaf-90c5-b3d1-5c2a-38b0178a1763 IP:10.214.128.42 RuntimeHandler:} Status:0xc4202722c0 Container:0xc4200ece80 NetNS:0xc42072c280 StopCh:0xc4206fd2e0}"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.090115612Z" level=debug msg="Loaded sandbox {Metadata:{ID:095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683 Name:kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_0 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:kube-scheduler-node1,Uid:705e7ce1217a37349a5567101e60165d,Namespace:kube-system,Attempt:0,},Hostname:,LogDirectory:/var/log/pods/705e7ce1217a37349a5567101e60165d,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[],Labels:map[string]string{component: kube-scheduler,io.kubernetes.pod.name: kube-scheduler-node1,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 705e7ce1217a37349a5567101e60165d,tier: control-plane,},Annotations:map[string]string{kubernetes.io/config.hash: 705e7ce1217a37349a5567101e60165d,kubernetes.io/config.seen: 2018-12-19T21:18:55.521061971Z,kubernetes.io/config.source: file,scheduler.alpha.kubernetes.io/critical-pod: ,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:NODE,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath: IP: RuntimeHandler:} Status:0xc420272b40 Container:0xc4200ecec0 NetNS:<nil> StopCh:0xc4202ee520}"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.090454590Z" level=debug msg="Loaded sandbox {Metadata:{ID:119984f0ffc088a3afbedc46e3a1f60642e4537204ff21ae6ca999201fe44539 Name:kube-proxy-dnhz7_kube-system_57cb5193-03d2-11e9-9447-005056911476_1 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:kube-proxy-dnhz7,Uid:57cb5193-03d2-11e9-9447-005056911476,Namespace:kube-system,Attempt:1,},Hostname:,LogDirectory:/var/log/pods/57cb5193-03d2-11e9-9447-005056911476,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[],Labels:map[string]string{controller-revision-hash: 4270652902,io.kubernetes.pod.name: kube-proxy-dnhz7,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 57cb5193-03d2-11e9-9447-005056911476,k8s-app: kube-proxy,pod-template-generation: 6,},Annotations:map[string]string{kubernetes.io/config.seen: 2018-12-19T21:22:46.189063942Z,kubernetes.io/config.source: api,scheduler.alpha.kubernetes.io/critical-pod: ,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/besteffort/pod57cb5193-03d2-11e9-9447-005056911476,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:NODE,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:true,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath: IP: RuntimeHandler:} Status:0xc4202733c0 Container:0xc4200ecee0 NetNS:<nil> StopCh:0xc4202ef4e0}"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.090823026Z" level=debug msg="Loaded sandbox {Metadata:{ID:139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2 Name:kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_2 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:kube-scheduler-node1,Uid:705e7ce1217a37349a5567101e60165d,Namespace:kube-system,Attempt:2,},Hostname:,LogDirectory:/var/log/pods/705e7ce1217a37349a5567101e60165d,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[],Labels:map[string]string{component: kube-scheduler,io.kubernetes.pod.name: kube-scheduler-node1,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 705e7ce1217a37349a5567101e60165d,tier: control-plane,},Annotations:map[string]string{kubernetes.io/config.hash: 705e7ce1217a37349a5567101e60165d,kubernetes.io/config.seen: 2018-12-19T21:18:55.521061971Z,kubernetes.io/config.source: file,scheduler.alpha.kubernetes.io/critical-pod: ,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:NODE,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath: IP: RuntimeHandler:} Status:0xc420273c00 Container:0xc4200ecf00 NetNS:<nil> StopCh:0xc4203a8d60}"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.091574305Z" level=debug msg="Loaded sandbox {Metadata:{ID:193604556bad15403c30c957a0df8956be0cf1560c834713e076c0e5032acb0e Name:kube-proxy-dnhz7_kube-system_57cb5193-03d2-11e9-9447-005056911476_0 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:kube-proxy-dnhz7,Uid:57cb5193-03d2-11e9-9447-005056911476,Namespace:kube-system,Attempt:0,},Hostname:,LogDirectory:/var/log/pods/57cb5193-03d2-11e9-9447-005056911476,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[],Labels:map[string]string{controller-revision-hash: 4270652902,io.kubernetes.pod.name: kube-proxy-dnhz7,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 57cb5193-03d2-11e9-9447-005056911476,k8s-app: kube-proxy,pod-template-generation: 6,},Annotations:map[string]string{kubernetes.io/config.seen: 2018-12-19T21:22:46.189063942Z,kubernetes.io/config.source: api,scheduler.alpha.kubernetes.io/critical-pod: ,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/besteffort/pod57cb5193-03d2-11e9-9447-005056911476,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:NODE,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:true,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath: IP: RuntimeHandler:} Status:0xc420294cc0 Container:0xc4200ecf40 NetNS:<nil> StopCh:0xc4206a37c0}"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.091984256Z" level=debug msg="Loaded sandbox {Metadata:{ID:22be9224ae90904810a18081a75145c511a1834a9b524332543f6bb093a38645 Name:coredns-78fcdf6894-mrsp7_kube-system_2381c786-03d8-11e9-97aa-005056911476_0 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:coredns-78fcdf6894-mrsp7,Uid:2381c786-03d8-11e9-97aa-005056911476,Namespace:kube-system,Attempt:0,},Hostname:coredns-78fcdf6894-mrsp7,LogDirectory:/var/log/pods/2381c786-03d8-11e9-97aa-005056911476,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[&PortMapping{Protocol:UDP,ContainerPort:53,HostPort:0,HostIp:,} &PortMapping{Protocol:TCP,ContainerPort:53,HostPort:0,HostIp:,} &PortMapping{Protocol:TCP,ContainerPort:9153,HostPort:0,HostIp:,}],Labels:map[string]string{io.kubernetes.pod.name: coredns-78fcdf6894-mrsp7,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 2381c786-03d8-11e9-97aa-005056911476,k8s-app: kube-dns,pod-template-hash: 3497892450,},Annotations:map[string]string{kubernetes.io/config.seen: 2018-12-19T21:50:51.791796041Z,kubernetes.io/config.source: api,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod2381c786-03d8-11e9-97aa-005056911476,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath:/var/run/netns/cni-ff2329e6-ee16-20a9-36bb-7617670f7e47 IP:10.214.128.63 RuntimeHandler:} Status:0xc4202954c0 Container:0xc4200ecf60 NetNS:0xc42072d8c0 StopCh:0xc4206b2d00}"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.092333938Z" level=debug msg="Loaded sandbox {Metadata:{ID:2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9 Name:kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_2 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:kube-scheduler-node1,Uid:705e7ce1217a37349a5567101e60165d,Namespace:kube-system,Attempt:2,},Hostname:,LogDirectory:/var/log/pods/705e7ce1217a37349a5567101e60165d,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[],Labels:map[string]string{component: kube-scheduler,io.kubernetes.pod.name: kube-scheduler-node1,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 705e7ce1217a37349a5567101e60165d,tier: control-plane,},Annotations:map[string]string{kubernetes.io/config.hash: 705e7ce1217a37349a5567101e60165d,kubernetes.io/config.seen: 2018-12-19T21:18:55.521061971Z,kubernetes.io/config.source: file,scheduler.alpha.kubernetes.io/critical-pod: ,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:NODE,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath: IP: RuntimeHandler:} Status:0xc420295cc0 Container:0xc4200ecf80 NetNS:<nil> StopCh:0xc4206d61c0}"
Jan 02 17:13:13 node1 containerd[80463]: time="2019-01-02T17:13:13.092422629Z" level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve sandbox name "kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_2": name "kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_2" is reserved for "139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2""

containerd 1.2.1
Kubernetes 1.11.6
Ubuntu 18.04

# cat /etc/containerd/config.toml
[debug]
  level = "debug"

Most likely the abrupt shutdown caused the containerd database and filesystem to become out of sync, but I would hope containerd could be more aggressive in recovering from such an error. The container is stateless and should just be forcibly removed and re-added.

At minimum, is there any workaround to recover from the above? I've already tried deleting /var/lib/containerd/io.containerd.runtime.v1.linux/k8s.io/139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2 and it didn't change anything.

The text was updated successfully, but these errors were encountered:

Random-Liu · 2019-01-02T18:38:27Z

msg="Loaded sandbox {Metadata:{ID:095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683 Name:kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_0 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:kube-scheduler-node1,Uid:705e7ce1217a37349a5567101e60165d,Namespace:kube-system,Attempt:0,},Hostname:,LogDirectory:/var/log/pods/705e7ce1217a37349a5567101e60165d,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[],Labels:map[string]string{component: kube-scheduler,io.kubernetes.pod.name: kube-scheduler-node1,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 705e7ce1217a37349a5567101e60165d,tier: control-plane,},Annotations:map[string]string{kubernetes.io/config.hash: 705e7ce1217a37349a5567101e60165d,kubernetes.io/config.seen: 2018-12-19T21:18:55.521061971Z,kubernetes.io/config.source: file,scheduler.alpha.kubernetes.io/critical-pod: ,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:NODE,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath: IP: RuntimeHandler:} Status:0xc420272b40 Container:0xc4200ecec0 NetNS:<nil> StopCh:0xc4202ee520}"
msg="Loaded sandbox {Metadata:{ID:139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2 Name:kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_2 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:kube-scheduler-node1,Uid:705e7ce1217a37349a5567101e60165d,Namespace:kube-system,Attempt:2,},Hostname:,LogDirectory:/var/log/pods/705e7ce1217a37349a5567101e60165d,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[],Labels:map[string]string{component: kube-scheduler,io.kubernetes.pod.name: kube-scheduler-node1,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 705e7ce1217a37349a5567101e60165d,tier: control-plane,},Annotations:map[string]string{kubernetes.io/config.hash: 705e7ce1217a37349a5567101e60165d,kubernetes.io/config.seen: 2018-12-19T21:18:55.521061971Z,kubernetes.io/config.source: file,scheduler.alpha.kubernetes.io/critical-pod: ,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:NODE,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath: IP: RuntimeHandler:} Status:0xc420273c00 Container:0xc4200ecf00 NetNS:<nil> StopCh:0xc4203a8d60}"
msg="Loaded sandbox {Metadata:{ID:2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9 Name:kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_2 Config:&PodSandboxConfig{Metadata:&PodSandboxMetadata{Name:kube-scheduler-node1,Uid:705e7ce1217a37349a5567101e60165d,Namespace:kube-system,Attempt:2,},Hostname:,LogDirectory:/var/log/pods/705e7ce1217a37349a5567101e60165d,DnsConfig:&DNSConfig{Servers:[10.0.22.45 1.1.1.1],Searches:[firescope.int],Options:[],},PortMappings:[],Labels:map[string]string{component: kube-scheduler,io.kubernetes.pod.name: kube-scheduler-node1,io.kubernetes.pod.namespace: kube-system,io.kubernetes.pod.uid: 705e7ce1217a37349a5567101e60165d,tier: control-plane,},Annotations:map[string]string{kubernetes.io/config.hash: 705e7ce1217a37349a5567101e60165d,kubernetes.io/config.seen: 2018-12-19T21:18:55.521061971Z,kubernetes.io/config.source: file,scheduler.alpha.kubernetes.io/critical-pod: ,},Linux:&LinuxPodSandboxConfig{CgroupParent:/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d,SecurityContext:&LinuxSandboxSecurityContext{NamespaceOptions:&NamespaceOption{Network:NODE,Pid:CONTAINER,Ipc:POD,},SelinuxOptions:nil,RunAsUser:nil,ReadonlyRootfs:false,SupplementalGroups:[],Privileged:false,SeccompProfilePath:,RunAsGroup:nil,},Sysctls:map[string]string{},},} NetNSPath: IP: RuntimeHandler:} Status:0xc420295cc0 Container:0xc4200ecf80 NetNS:<nil> StopCh:0xc4206d61c0}"

The problem is that you have 2 instances of kube-scheduler which have "Attempt == 2".

Do you have kubelet and contianerd log before the reboot? We need to figure out why that happened.

There are 2 possibilities:

Containerd checkpoint corruption, one of the attempt should be 1, but got corrupted. (Unlikely, because we use atomic file operations for checkpoint).
Kubelet creates 2 sandbox with attempt 2. Then we need to understand why kubelet does it.

mikebrow · 2019-01-02T18:48:52Z

why it happened.. and how to recover from it.. with the other instance 0 of scheduler also sitting there in the list of sandbox containers. We may need to include additional aggressive recovery logic like we have on regular containers.. on the sandbox containers to address orphaned metadata entries and/or orphaned sandbox containers.

Random-Liu · 2019-01-02T18:55:17Z

@mikebrow Kubelet should handle everything, and we should only provide the truth of current state.

However, in this case, there must be something wrong, either we are not providing the truth, or kubelet is not behaving correctly.

steven-sheehy · 2019-01-02T19:03:32Z

Thanks for the responses. Unfortunately I didn't have kubelet or containerd log level set to debug at the time the issue happened. As you can see the VM was in such a bad state (I think due to high CPU and memory utilization) that kubelet doesn't even appear to have logs printed for awhile before reboot. Here are the kubelet logs at the time of reboot:

Jan 02 05:59:45 node1 kubelet[250135]: W0102 05:44:26.511266  250135 container.go:507] Failed to update stats for container "/system.slice/systemd-journald.service": failed to parse memory.usage_in_bytes - read /sys/fs/cgroup/memory/system.slice/systemd-journald.se
rvice/memory.usage_in_bytes: no such device, continuing to push stats
Jan 02 06:02:50 node1 kubelet[250135]: I0102 05:58:15.625938  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 65h28m34.260464164s ago; threshold is 3m0s]
Jan 02 06:03:35 node1 kubelet[250135]: E0102 05:52:05.467883  250135 kuberuntime_sandbox.go:198] ListPodSandbox failed: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 06:05:34 node1 kubelet[250135]: E0102 06:03:59.681519  250135 container_log_manager.go:174] Failed to rotate container logs: failed to list containers: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 06:06:09 node1 kubelet[250135]: E0102 06:03:13.975741  250135 remote_image.go:67] ListImages with filter nil from image service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:06:59 node1 kubelet[250135]: E0102 06:02:00.945335  250135 kubelet.go:2093] Container runtime sanity check failed: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 06:07:12 node1 kubelet[250135]: E0102 06:04:39.453068  250135 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:08:18 node1 kubelet[250135]: E0102 06:01:24.763178  250135 remote_image.go:83] ImageStatus "k8s.gcr.io/pause:3.1" from image service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:08:54 node1 kubelet[250135]: E0102 06:04:16.730867  250135 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 06:09:27 node1 kubelet[250135]: I0102 06:06:09.937213  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 65h52m25.104780642s ago; threshold is 3m0s]
Jan 02 06:09:51 node1 kubelet[250135]: E0102 06:07:00.448259  250135 remote_image.go:146] ImageFsInfo from image service failed: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 06:13:47 node1 kubelet[250135]: E0102 06:08:25.495647  250135 kubelet.go:1224] Container garbage collection failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:16:45 node1 kubelet[250135]: E0102 06:11:04.949897  250135 kubelet.go:1248] Image garbage collection failed multiple times in a row: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 06:18:31 node1 kubelet[250135]: E0102 06:10:47.436995  250135 remote_runtime.go:332] ExecSync eded36f67fb9fb5d4b53c2b13031b005488e4299b4525f19f067d5e4a28ec15b '/bin/sh -ec ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pk
i/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:19:27 node1 kubelet[250135]: I0102 06:17:01.793096  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 65h59m22.909790739s ago; threshold is 3m0s]
Jan 02 06:19:43 node1 kubelet[250135]: W0102 06:19:43.265573  250135 container.go:507] Failed to update stats for container "/system.slice/rsyslog.service": read /sys/fs/cgroup/cpu,cpuacct/system.slice/rsyslog.service/cpuacct.usage_percpu: no such device, continuin
g to push stats
Jan 02 06:20:35 node1 kubelet[250135]: E0102 06:19:53.091800  250135 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://10.0.22.191:6443/api/v1/nodes?fieldSelector=metadata.name%3Dnode1&limit=500&resourceVersio
n=0: dial tcp 10.0.22.191:6443: i/o timeout
Jan 02 06:24:18 node1 kubelet[250135]: E0102 06:12:45.820773  250135 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = Unavailable desc =
 grpc: the connection is unavailable
Jan 02 06:26:04 node1 kubelet[250135]: E0102 06:19:43.769109  250135 kuberuntime_image.go:102] ListImages failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:49:56 node1 kubelet[250135]: I0102 06:25:12.597611  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 66h11m46.185165456s ago; threshold is 3m0s]
Jan 02 06:49:56 node1 kubelet[250135]: E0102 06:26:48.997372  250135 container_log_manager.go:174] Failed to rotate container logs: failed to list containers: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 06:49:56 node1 kubelet[250135]: E0102 06:24:31.441784  250135 kuberuntime_image.go:87] ImageStatus for image {"k8s.gcr.io/pause:3.1"} failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:49:56 node1 kubelet[250135]: W0102 06:28:13.485985  250135 image_gc_manager.go:192] [imageGCManager] Failed to update image list: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:49:56 node1 kubelet[250135]: E0102 06:32:23.862407  250135 remote_runtime.go:434] Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 06:50:10 node1 kubelet[250135]: I0102 06:42:00.569834  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 66h20m49.039936615s ago; threshold is 3m0s]
Jan 02 06:51:09 node1 kubelet[250135]: E0102 06:43:59.285805  250135 kubelet_network.go:106] Failed to ensure marking rule for KUBE-MARK-DROP: timed out while checking rules
Jan 02 06:51:37 node1 kubelet[250135]: W0102 06:48:09.771347  250135 container.go:507] Failed to update stats for container "/system.slice/systemd-journald.service": failed to parse memory.usage_in_bytes - read /sys/fs/cgroup/memory/system.slice/systemd-journald.se
rvice/memory.usage_in_bytes: no such device, continuing to push stats
Jan 02 06:51:43 node1 kubelet[250135]: E0102 06:43:18.110305  250135 remote_runtime.go:169] ListPodSandbox with filter nil from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:41:35 node1 kubelet[250135]: I0102 06:53:03.432801  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 66h40m8.970872531s ago; threshold is 3m0s]
Jan 02 07:49:13 node1 kubelet[250135]: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:49:13 node1 kubelet[250135]: E0102 07:14:31.703814  250135 remote_image.go:67] ListImages with filter nil from image service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:49:13 node1 kubelet[250135]: E0102 07:27:16.923135  250135 container_log_manager.go:174] Failed to rotate container logs: failed to list containers: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:49:13 node1 kubelet[250135]: E0102 07:33:29.187245  250135 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:49:13 node1 kubelet[250135]: E0102 07:33:00.805360  250135 kuberuntime_image.go:102] ListImages failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:49:13 node1 kubelet[250135]: I0102 07:34:07.293984  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 67h16m46.680280374s ago; threshold is 3m0s]
Jan 02 07:49:13 node1 kubelet[250135]: W0102 07:40:12.294121  250135 container.go:507] Failed to update stats for container "/system.slice/systemd-journald.service": read /sys/fs/cgroup/cpu,cpuacct/system.slice/systemd-journald.service/cpuacct.stat: no such device,
 continuing to push stats
Jan 02 07:49:13 node1 kubelet[250135]: E0102 07:36:32.089071  250135 event.go:212] Unable to write event: 'Patch https://10.0.22.191:6443/api/v1/namespaces/production/events/edge-rabbitmq-0.1574dcf4da048fa4: dial tcp 10.0.22.191:6443: i/o timeout' (may retry after
sleeping)
Jan 02 07:52:12 node1 kubelet[250135]: E0102 07:46:56.824013  250135 remote_image.go:146] ImageFsInfo from image service failed: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 07:52:12 node1 kubelet[250135]: W0102 07:43:01.909734  250135 image_gc_manager.go:181] [imageGCManager] Failed to monitor images: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:54:22 node1 kubelet[250135]: E0102 07:44:32.112734  250135 remote_runtime.go:434] Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:54:22 node1 kubelet[250135]: E0102 07:44:12.374300  250135 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:54:46 node1 kubelet[250135]: E0102 07:48:54.837475  250135 kuberuntime_image.go:102] ListImages failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:55:52 node1 kubelet[250135]: I0102 07:49:18.584202  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 67h31m0.218586782s ago; threshold is 3m0s]
Jan 02 07:58:00 node1 kubelet[250135]: E0102 07:57:03.467992  250135 kubelet.go:1224] Container garbage collection failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 07:59:18 node1 kubelet[250135]: W0102 07:57:45.964650  250135 image_gc_manager.go:192] [imageGCManager] Failed to update image list: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Jan 02 08:02:52 node1 kubelet[250135]: E0102 07:54:29.637241  250135 kubelet.go:1248] Image garbage collection failed multiple times in a row: rpc error: code = Unavailable desc = grpc: the connection is unavailable
Jan 02 08:03:43 node1 kubelet[250135]: I0102 08:03:31.886928  250135 kubelet.go:1771] skipping pod synchronization - [container runtime is down PLEG is not healthy: pleg was last seen active 67h46m6.515419587s ago; threshold is 3m0s]
-- Reboot --
Jan 02 16:12:10 node1 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jan 02 16:12:52 node1 kubelet[1014]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jan 02 16:12:52 node1 kubelet[1014]: Flag --container-log-max-files has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jan 02 16:12:52 node1 kubelet[1014]: Flag --container-log-max-size has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jan 02 16:12:52 node1 kubelet[1014]: Flag --runtime-request-timeout has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jan 02 16:12:53 node1 kubelet[1014]: Flag --resolv-conf has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jan 02 16:12:53 node1 kubelet[1014]: Flag --container-log-max-files has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jan 02 16:12:53 node1 kubelet[1014]: Flag --container-log-max-size has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jan 02 16:12:53 node1 kubelet[1014]: Flag --runtime-request-timeout has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jan 02 16:12:53 node1 kubelet[1014]: I0102 16:12:53.516342    1014 server.go:408] Version: v1.11.6
Jan 02 16:12:53 node1 kubelet[1014]: I0102 16:12:53.618406    1014 plugins.go:97] No cloud provider specified.
Jan 02 16:12:53 node1 kubelet[1014]: I0102 16:12:53.669199    1014 certificate_store.go:131] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.477425    1014 server.go:648] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.505471    1014 container_manager_linux.go:243] container manager verified user specified cgroup-root exists: []
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.505562    1014 container_manager_linux.go:248] Creating Container Manager object based on Node Config: {RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsName: ContainerRuntime:remote CgroupsPerQOS:true CgroupRoot:/ CgroupDriver:cgroupfs KubeletRootDir:/var/lib/kubelet ProtectKernelDefaults:false NodeAllocatableConfig:{KubeReservedCgroupName: SystemReservedCgroupName: EnforceNodeAllocatable:map[pods:{}] KubeReserved:map[] SystemReserved:map[] HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:100Mi Percentage:0} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.1} GracePeriod:0s MinReclaim:<nil>} {Signal:nodefs.inodesFree Operator:LessThan Value:{Quantity:<nil> Percentage:0.05} GracePeriod:0s MinReclaim:<nil>} {Signal:imagefs.available Operator:LessThan Value:{Quantity:<nil> Percentage:0.15} GracePeriod:0s MinReclaim:<nil>}]} QOSReserved:map[] ExperimentalCPUManagerPolicy:none ExperimentalCPUManagerReconcilePeriod:10s ExperimentalPodPidsLimit:-1 EnforceCPULimits:true}
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.506035    1014 container_manager_linux.go:267] Creating device plugin manager: true
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.506177    1014 state_mem.go:36] [cpumanager] initializing new in-memory state store
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.663586    1014 state_mem.go:84] [cpumanager] updated default cpuset: ""
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.663655    1014 state_mem.go:92] [cpumanager] updated cpuset assignments: "map[]"
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.677989    1014 kubelet.go:274] Adding pod path: /etc/kubernetes/manifests
Jan 02 16:12:54 node1 kubelet[1014]: I0102 16:12:54.692507    1014 kubelet.go:299] Watching apiserver
Jan 02 16:12:55 node1 kubelet[1014]: E0102 16:12:55.245453    1014 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://10.0.22.191:6443/api/v1/nodes?fieldSelector=metadata.name%3Dnode1&limit=500&resourceVersion=0: dial tcp 10.0.22.191:6443: connect: connection refused
Jan 02 16:12:55 node1 kubelet[1014]: E0102 16:12:55.245533    1014 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://10.0.22.191:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.0.22.191:6443: connect: connection refused
Jan 02 16:12:55 node1 kubelet[1014]: E0102 16:12:55.245577    1014 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.0.22.191:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dnode1&limit=500&resourceVersion=0: dial tcp 10.0.22.191:6443: connect: connection refused
Jan 02 16:12:56 node1 kubelet[1014]: E0102 16:12:56.246381    1014 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:464: Failed to list *v1.Node: Get https://10.0.22.191:6443/api/v1/nodes?fieldSelector=metadata.name%3Dnode1&limit=500&resourceVersion=0: dial tcp 10.0.22.191:6443: connect: connection refused
Jan 02 16:12:56 node1 kubelet[1014]: E0102 16:12:56.247557    1014 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:455: Failed to list *v1.Service: Get https://10.0.22.191:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.0.22.191:6443: connect: connection refused
Jan 02 16:12:56 node1 kubelet[1014]: E0102 16:12:56.248821    1014 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://10.0.22.191:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dnode1&limit=500&resourceVersion=0: dial tcp 10.0.22.191:6443: connect: connection refused
Jan 02 16:12:56 node1 kubelet[1014]: E0102 16:12:56.680498    1014 remote_runtime.go:69] Version from runtime service failed: rpc error: code = Unknown desc = server is not initialized yet
Jan 02 16:12:56 node1 kubelet[1014]: E0102 16:12:56.680840    1014 kuberuntime_manager.go:172] Get runtime version failed: rpc error: code = Unknown desc = server is not initialized yet
Jan 02 16:12:56 node1 kubelet[1014]: F0102 16:12:56.680888    1014 server.go:262] failed to run Kubelet: failed to create kubelet: rpc error: code = Unknown desc = server is not initialized yet
Jan 02 16:12:56 node1 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jan 02 16:12:56 node1 systemd[1]: kubelet.service: Failed with result 'exit-code'.
Jan 02 16:13:06 node1 systemd[1]: kubelet.service: Service hold-off time over, scheduling restart.
Jan 02 16:13:06 node1 systemd[1]: kubelet.service: Scheduled restart job, restart counter is at 1.
Jan 02 16:13:06 node1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.

And the containerd logs at time of reboot:

Jan 02 04:41:45 node1 containerd[29802]: time="2019-01-02T04:41:39.673899977Z" level=info msg="starting containerd" revision=9b32062dc1f5a7c2564315c269b5059754f12b9d version=v1.2.1
Jan 02 04:41:45 node1 containerd[29802]: time="2019-01-02T04:41:46.054889862Z" level=info msg="loading plugin "io.containerd.content.v1.content"..." type=io.containerd.content.v1
Jan 02 04:42:06 node1 containerd[29802]: time="2019-01-02T04:41:56.968366655Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." type=io.containerd.snapshotter.v1
Jan 02 04:43:11 node1 containerd[29802]: time="2019-01-02T04:42:54.968807606Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.btrfs" error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with
 the btrfs snapshotter"
Jan 02 04:43:17 node1 containerd[29802]: time="2019-01-02T04:43:16.423999810Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.aufs"..." type=io.containerd.snapshotter.v1
Jan 02 04:43:20 node1 containerd[29802]: time="2019-01-02T04:43:20.467404089Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
Jan 02 04:43:20 node1 containerd[29802]: time="2019-01-02T04:43:20.481799661Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
Jan 02 04:43:20 node1 containerd[29802]: time="2019-01-02T04:43:20.560129408Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
Jan 02 04:43:22 node1 containerd[29802]: time="2019-01-02T04:43:20.930865651Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.zfs" error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the z
fs snapshotter"
Jan 02 04:43:22 node1 containerd[29802]: time="2019-01-02T04:43:20.931327785Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
Jan 02 04:43:22 node1 containerd[29802]: time="2019-01-02T04:43:20.932160282Z" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the b
trfs snapshotter"
Jan 02 04:43:22 node1 containerd[29802]: time="2019-01-02T04:43:20.932200870Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs sna
pshotter"
Jan 02 04:43:23 node1 containerd[29802]: time="2019-01-02T04:43:23.473684376Z" level=info msg="loading plugin "io.containerd.differ.v1.walking"..." type=io.containerd.differ.v1
Jan 02 04:43:23 node1 containerd[29802]: time="2019-01-02T04:43:23.474422488Z" level=info msg="loading plugin "io.containerd.gc.v1.scheduler"..." type=io.containerd.gc.v1
Jan 02 04:43:23 node1 containerd[29802]: time="2019-01-02T04:43:23.753873589Z" level=info msg="loading plugin "io.containerd.service.v1.containers-service"..." type=io.containerd.service.v1
Jan 02 04:43:30 node1 containerd[29802]: time="2019-01-02T04:43:23.920711116Z" level=info msg="loading plugin "io.containerd.service.v1.content-service"..." type=io.containerd.service.v1
Jan 02 04:43:30 node1 containerd[29802]: time="2019-01-02T04:43:23.920802019Z" level=info msg="loading plugin "io.containerd.service.v1.diff-service"..." type=io.containerd.service.v1
Jan 02 04:43:30 node1 containerd[29802]: time="2019-01-02T04:43:24.190301964Z" level=info msg="loading plugin "io.containerd.service.v1.images-service"..." type=io.containerd.service.v1
Jan 02 04:43:30 node1 containerd[29802]: time="2019-01-02T04:43:24.490346644Z" level=info msg="loading plugin "io.containerd.service.v1.leases-service"..." type=io.containerd.service.v1
Jan 02 04:43:30 node1 containerd[29802]: time="2019-01-02T04:43:24.490428291Z" level=info msg="loading plugin "io.containerd.service.v1.namespaces-service"..." type=io.containerd.service.v1
Jan 02 04:43:30 node1 containerd[29802]: time="2019-01-02T04:43:24.490459716Z" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
Jan 02 04:43:30 node1 containerd[29802]: time="2019-01-02T04:43:24.490492829Z" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
-- Reboot --
Jan 02 16:12:13 node1 systemd[1]: Starting containerd container runtime...
Jan 02 16:12:14 node1 systemd[1]: Started containerd container runtime.
Jan 02 16:12:40 node1 containerd[1119]: time="2019-01-02T16:12:40.703410994Z" level=info msg="starting containerd" revision=9b32062dc1f5a7c2564315c269b5059754f12b9d version=v1.2.1
Jan 02 16:12:40 node1 containerd[1119]: time="2019-01-02T16:12:40.704577289Z" level=info msg="loading plugin "io.containerd.content.v1.content"..." type=io.containerd.content.v1
Jan 02 16:12:40 node1 containerd[1119]: time="2019-01-02T16:12:40.717921482Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.btrfs"..." type=io.containerd.snapshotter.v1
Jan 02 16:12:40 node1 containerd[1119]: time="2019-01-02T16:12:40.718634192Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.btrfs" error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
Jan 02 16:12:40 node1 containerd[1119]: time="2019-01-02T16:12:40.718821163Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.aufs"..." type=io.containerd.snapshotter.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.311105860Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.native"..." type=io.containerd.snapshotter.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.311671703Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.overlayfs"..." type=io.containerd.snapshotter.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.323692741Z" level=info msg="loading plugin "io.containerd.snapshotter.v1.zfs"..." type=io.containerd.snapshotter.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.324241507Z" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.zfs" error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.324273154Z" level=info msg="loading plugin "io.containerd.metadata.v1.bolt"..." type=io.containerd.metadata.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.324322551Z" level=warning msg="could not use snapshotter btrfs in metadata plugin" error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs must be a btrfs filesystem to be used with the btrfs snapshotter"
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.324346307Z" level=warning msg="could not use snapshotter zfs in metadata plugin" error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter"
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.809974892Z" level=info msg="loading plugin "io.containerd.differ.v1.walking"..." type=io.containerd.differ.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810046364Z" level=info msg="loading plugin "io.containerd.gc.v1.scheduler"..." type=io.containerd.gc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810146257Z" level=info msg="loading plugin "io.containerd.service.v1.containers-service"..." type=io.containerd.service.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810194589Z" level=info msg="loading plugin "io.containerd.service.v1.content-service"..." type=io.containerd.service.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810233620Z" level=info msg="loading plugin "io.containerd.service.v1.diff-service"..." type=io.containerd.service.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810265645Z" level=info msg="loading plugin "io.containerd.service.v1.images-service"..." type=io.containerd.service.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810313526Z" level=info msg="loading plugin "io.containerd.service.v1.leases-service"..." type=io.containerd.service.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810345343Z" level=info msg="loading plugin "io.containerd.service.v1.namespaces-service"..." type=io.containerd.service.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810374493Z" level=info msg="loading plugin "io.containerd.service.v1.snapshots-service"..." type=io.containerd.service.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810405283Z" level=info msg="loading plugin "io.containerd.runtime.v1.linux"..." type=io.containerd.runtime.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810598859Z" level=info msg="loading plugin "io.containerd.runtime.v2.task"..." type=io.containerd.runtime.v2
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.810743549Z" level=info msg="loading plugin "io.containerd.monitor.v1.cgroups"..." type=io.containerd.monitor.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811431890Z" level=info msg="loading plugin "io.containerd.service.v1.tasks-service"..." type=io.containerd.service.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811478582Z" level=info msg="loading plugin "io.containerd.internal.v1.restart"..." type=io.containerd.internal.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811573944Z" level=info msg="loading plugin "io.containerd.grpc.v1.containers"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811608875Z" level=info msg="loading plugin "io.containerd.grpc.v1.content"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811638061Z" level=info msg="loading plugin "io.containerd.grpc.v1.diff"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811666848Z" level=info msg="loading plugin "io.containerd.grpc.v1.events"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811694655Z" level=info msg="loading plugin "io.containerd.grpc.v1.healthcheck"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811723269Z" level=info msg="loading plugin "io.containerd.grpc.v1.images"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811752069Z" level=info msg="loading plugin "io.containerd.grpc.v1.leases"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811780686Z" level=info msg="loading plugin "io.containerd.grpc.v1.namespaces"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.811809390Z" level=info msg="loading plugin "io.containerd.internal.v1.opt"..." type=io.containerd.internal.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.825527132Z" level=info msg="loading plugin "io.containerd.grpc.v1.snapshots"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.825594559Z" level=info msg="loading plugin "io.containerd.grpc.v1.tasks"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:41.825627273Z" level=info msg="loading plugin "io.containerd.grpc.v1.version"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:43.618229638Z" level=info msg="loading plugin "io.containerd.grpc.v1.cri"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:43.618524729Z" level=info msg="Start cri plugin with config {PluginConfig:{ContainerdConfig:{Snapshotter:overlayfs DefaultRuntime:{Type:io.containerd.runtime.v1.linux Engine: Root: Options:<nil>} UntrustedWorkloadRuntime:{Type: Engine: Root: Options:<nil>} Runtimes:map[] NoPivot:false} CniConfig:{NetworkPluginBinDir:/opt/cni/bin NetworkPluginConfDir:/etc/cni/net.d NetworkPluginConfTemplate:} Registry:{Mirrors:map[docker.io:{Endpoints:[https://registry-1.docker.io]}] Auths:map[]} StreamServerAddress:127.0.0.1 StreamServerPort:0 EnableSelinux:false SandboxImage:k8s.gcr.io/pause:3.1 StatsCollectPeriod:10 SystemdCgroup:false EnableTLSStreaming:false X509KeyPairStreaming:{TLSCertFile: TLSKeyFile:} MaxContainerLogLineSize:16384} ContainerdRootDir:/var/lib/containerd ContainerdEndpoint:/run/containerd/containerd.sock RootDir:/var/lib/containerd/io.containerd.grpc.v1.cri StateDir:/run/containerd/io.containerd.grpc.v1.cri}"
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:43.618663252Z" level=info msg="Connect containerd service"
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:43.618975085Z" level=info msg="Get image filesystem path "/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs""
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:44.568049691Z" level=info msg="loading plugin "io.containerd.grpc.v1.introspection"..." type=io.containerd.grpc.v1
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:44.568153282Z" level=info msg="Start subscribing containerd event"
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:44.568232670Z" level=info msg="Start recovering state"
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:44.575313027Z" level=info msg=serving... address="/run/containerd/containerd.sock"
Jan 02 16:12:45 node1 containerd[1119]: time="2019-01-02T16:12:44.575369347Z" level=info msg="containerd successfully booted in 4.100130s"
Jan 02 16:12:56 node1 containerd[1119]: time="2019-01-02T16:12:56.662098389Z" level=fatal msg="Failed to run CRI service" error="failed to recover state: failed to reserve sandbox name "kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_2": name "kube-scheduler-node1_kube-system_705e7ce1217a37349a5567101e60165d_2" is reserved for "139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2""
Jan 02 16:12:56 node1 systemd[1]: containerd.service: Main process exited, code=exited, status=1/FAILURE
Jan 02 16:12:56 node1 systemd[1]: containerd.service: Failed with result 'exit-code'.

Random-Liu · 2019-01-02T19:28:44Z

@steven-sheehy Can you search computePodActions got.*kube-scheduler in kubelet log?

Let's see whether we can find something useful, maybe ont.

steven-sheehy · 2019-01-02T19:51:53Z

Unfortunately, I don't see any mention of kube-scheduler in the kubelet logs. I do see that the system restarted on its own yesterday but before that hadn't rebooted since May. And I don't think there's anyway I can reproduce this either.

$ sudo journalctl --list-boots
-3 325b4653bae24ff5a2271c54f8017ee2 Thu 2018-05-24 20:35:46 UTC—Thu 2018-05-24 20:42:30 UTC
-2 64c780b5ccfe4835a38c263471e4765a Thu 2018-05-24 20:43:12 UTC—Thu 2018-05-24 20:47:51 UTC
-1 fe79afade444401e8fcaca3a0efba8a3 Tue 2019-01-01 19:52:53 UTC—Wed 2019-01-02 15:42:13 UTC
 0 48bf6b1af91f46a1aa89907ab7cd4517 Wed 2019-01-02 16:11:57 UTC—Wed 2019-01-02 19:38:42 UTC
$ sudo journalctl -u kubelet | grep -i "computePodActions got.*kube-scheduler"
$ sudo journalctl -u kubelet | grep -i kube-scheduler
$

steven-sheehy · 2019-01-02T20:01:36Z

If I can figure out how to start containerd, will there be any kubernetes audit events that are useful?

Random-Liu · 2019-01-02T21:24:31Z

If Kubelet creates different kube-scheduler instances with the same attempt, it should be rejected by containerd. So this is more likely an issue on the containerd side, either metadata corruption, or some bad logic.

Can you run the following commands and post the result here?

ctr -n=k8s.io containers info  095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683
ctr -n=k8s.io containers info 139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2
ctr -n=k8s.io containers info 2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9

We can check the created and updated timestamp, to see whether the metadata is touched after creation.

And then let's see whether we can find something related in the containerd/kubelet log around that time.

steven-sheehy · 2019-01-02T21:35:59Z

containerd is currently crashing with the aforementioned error, and it looks like ctr needs containerd to be running:

$ sudo ctr -n=k8s.io containers info  095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683
ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded
$ ps -aef | grep containerd
user+ 158624 148032  0 21:33 pts/1    00:00:00 grep --color=auto containerd
$

I think I could probably get containerd running by deleting /var/lib/containerd and /run/containerd, but was holding off in case you wanted me to debug anything further.

Random-Liu · 2019-01-02T21:38:43Z

you can start containerd with disable_plugins = [ cri ] in the /etc/containerd/config.toml

…

On Wed, Jan 2, 2019 at 1:36 PM Steven Sheehy ***@***.***> wrote: containerd is currently crashing with the aforementioned error, and it looks like ctr needs containerd to be running: $ sudo ctr -n=k8s.io containers info 095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683 ctr: failed to dial "/run/containerd/containerd.sock": context deadline exceeded $ ps -aef | grep containerd user+ 158624 148032 0 21:33 pts/1 00:00:00 grep --color=auto containerd $ I think I could probably get containerd running by deleting /var/lib/containerd and /run/containerd, but was holding off in case you wanted me to debug anything further. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1014 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFjVu_WAQkobBmO6A9gg825OuDeVYJSjks5u_SZAgaJpZM4Zm-Wc> .

steven-sheehy · 2019-01-02T21:52:40Z

Collapsed output below:

ctr -n=k8s.io containers info 095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683

{
    "ID": "095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683",
    "Labels": {
        "component": "kube-scheduler",
        "io.cri-containerd.kind": "sandbox",
        "io.kubernetes.pod.name": "kube-scheduler-node1",
        "io.kubernetes.pod.namespace": "kube-system",
        "io.kubernetes.pod.uid": "705e7ce1217a37349a5567101e60165d",
        "tier": "control-plane"
    },
    "Image": "sha256:da86e6ba6ca197bf6bc5e9d900febd906b133eaa4750e6bed647b0fbe50ed43e",
    "Runtime": {
        "Name": "io.containerd.runtime.v1.linux",
        "Options": {
            "type_url": "containerd.linux.runc.RuncOptions"
        }
    },
    "SnapshotKey": "095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683",
    "Snapshotter": "overlayfs",
    "CreatedAt": "2018-12-19T21:19:08.682812746Z",
    "UpdatedAt": "2018-12-19T21:19:08.682812746Z",
    "Extensions": {
        "io.cri-containerd.sandbox.metadata": {
            "type_url": "github.com/containerd/cri/pkg/store/sandbox/Metadata",
            "value": "eyJWZXJzaW9uIjoidjEiLCJNZXRhZGF0YSI6eyJJRCI6IjA5NWZhNWJmZGQ3NWU0MTEyOTBkMTFlOTBlYzg0YmMyZGVjODdiNGUxNWQ1YzYyMTU3NjFiMjUxOGM1Zjg2ODMiLCJOYW1lIjoia3ViZS1zY2hlZHVsZXItZmlyZXNjb3BlMV9rdWJlLXN5c3RlbV83MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZF8wIiwiQ29uZmlnIjp7Im1ldGFkYXRhIjp7Im5hbWUiOiJrdWJlLXNjaGVkdWxlci1maXJlc2NvcGUxIiwidWlkIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSJ9LCJsb2dfZGlyZWN0b3J5IjoiL3Zhci9sb2cvcG9kcy83MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZCIsImRuc19jb25maWciOnsic2VydmVycyI6WyIxMC4wLjIyLjQ1IiwiMS4xLjEuMSJdLCJzZWFyY2hlcyI6WyJmaXJlc2NvcGUuaW50Il19LCJsYWJlbHMiOnsiY29tcG9uZW50Ijoia3ViZS1zY2hlZHVsZXIiLCJpby5rdWJlcm5ldGVzLnBvZC5uYW1lIjoia3ViZS1zY2hlZHVsZXItZmlyZXNjb3BlMSIsImlvLmt1YmVybmV0ZXMucG9kLm5hbWVzcGFjZSI6Imt1YmUtc3lzdGVtIiwiaW8ua3ViZXJuZXRlcy5wb2QudWlkIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJ0aWVyIjoiY29udHJvbC1wbGFuZSJ9LCJhbm5vdGF0aW9ucyI6eyJrdWJlcm5ldGVzLmlvL2NvbmZpZy5oYXNoIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJrdWJlcm5ldGVzLmlvL2NvbmZpZy5zZWVuIjoiMjAxOC0xMi0xOVQyMToxODo1NS41MjEwNjE5NzFaIiwia3ViZXJuZXRlcy5pby9jb25maWcuc291cmNlIjoiZmlsZSIsInNjaGVkdWxlci5hbHBoYS5rdWJlcm5ldGVzLmlvL2NyaXRpY2FsLXBvZCI6IiJ9LCJsaW51eCI6eyJjZ3JvdXBfcGFyZW50IjoiL2t1YmVwb2RzL2J1cnN0YWJsZS9wb2Q3MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZCIsInNlY3VyaXR5X2NvbnRleHQiOnsibmFtZXNwYWNlX29wdGlvbnMiOnsibmV0d29yayI6MiwicGlkIjoxfX19fSwiTmV0TlNQYXRoIjoiIiwiSVAiOiIiLCJSdW50aW1lSGFuZGxlciI6IiJ9fQ=="
        }
    },
    "Spec": {
        "ociVersion": "1.0.1-dev",
        "process": {
            "user": {
                "uid": 0,
                "gid": 0
            },
            "args": [
                "/pause"
            ],
            "env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "cwd": "/",
            "capabilities": {
                "bounding": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "effective": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "inheritable": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "permitted": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ]
            },
            "noNewPrivileges": true,
            "oomScoreAdj": -998
        },
        "root": {
            "path": "rootfs",
            "readonly": true
        },
        "mounts": [
            {
                "destination": "/proc",
                "type": "proc",
                "source": "proc",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev"
                ]
            },
            {
                "destination": "/dev",
                "type": "tmpfs",
                "source": "tmpfs",
                "options": [
                    "nosuid",
                    "strictatime",
                    "mode=755",
                    "size=65536k"
                ]
            },
            {
                "destination": "/dev/pts",
                "type": "devpts",
                "source": "devpts",
                "options": [
                    "nosuid",
                    "noexec",
                    "newinstance",
                    "ptmxmode=0666",
                    "mode=0620",
                    "gid=5"
                ]
            },
            {
                "destination": "/dev/shm",
                "type": "tmpfs",
                "source": "shm",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev",
                    "mode=1777",
                    "size=65536k"
                ]
            },
            {
                "destination": "/dev/mqueue",
                "type": "mqueue",
                "source": "mqueue",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev"
                ]
            },
            {
                "destination": "/sys",
                "type": "sysfs",
                "source": "sysfs",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev",
                    "ro"
                ]
            },
            {
                "destination": "/dev/shm",
                "type": "bind",
                "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683/shm",
                "options": [
                    "rbind",
                    "ro"
                ]
            }
        ],
        "annotations": {
            "io.kubernetes.cri.container-type": "sandbox",
            "io.kubernetes.cri.sandbox-id": "095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683"
        },
        "linux": {
            "resources": {
                "devices": [
                    {
                        "allow": false,
                        "access": "rwm"
                    }
                ],
                "cpu": {
                    "shares": 2
                }
            },
            "cgroupsPath": "/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d/095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683",
            "namespaces": [
                {
                    "type": "pid"
                },
                {
                    "type": "ipc"
                },
                {
                    "type": "uts"
                },
                {
                    "type": "mount"
                }
            ],
            "maskedPaths": [
                "/proc/acpi",
                "/proc/asound",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/sys/firmware",
                "/proc/scsi"
            ],
            "readonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        }
    }
}

ctr -n=k8s.io containers info 139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2

{
    "ID": "139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2",
    "Labels": {
        "component": "kube-scheduler",
        "io.cri-containerd.kind": "sandbox",
        "io.kubernetes.pod.name": "kube-scheduler-node1",
        "io.kubernetes.pod.namespace": "kube-system",
        "io.kubernetes.pod.uid": "705e7ce1217a37349a5567101e60165d",
        "tier": "control-plane"
    },
    "Image": "k8s.gcr.io/pause:3.1",
    "Runtime": {
        "Name": "io.containerd.runtime.v1.linux",
        "Options": {
            "type_url": "containerd.linux.runc.RuncOptions"
        }
    },
    "SnapshotKey": "139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2",
    "Snapshotter": "overlayfs",
    "CreatedAt": "2018-12-30T09:53:52.272719082Z",
    "UpdatedAt": "2018-12-30T09:53:52.272719082Z",
    "Extensions": {
        "io.cri-containerd.sandbox.metadata": {
            "type_url": "github.com/containerd/cri/pkg/store/sandbox/Metadata",
            "value": "eyJWZXJzaW9uIjoidjEiLCJNZXRhZGF0YSI6eyJJRCI6IjEzOWJiMGFjN2UwNTBlOWUyOGI5OTRlNzhmNjUxYTg2MDlmNDI2ZjFiNWJiZmM4ODdhMGQ0YTMzNTBiNGVlZTIiLCJOYW1lIjoia3ViZS1zY2hlZHVsZXItZmlyZXNjb3BlMV9rdWJlLXN5c3RlbV83MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZF8yIiwiQ29uZmlnIjp7Im1ldGFkYXRhIjp7Im5hbWUiOiJrdWJlLXNjaGVkdWxlci1maXJlc2NvcGUxIiwidWlkIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImF0dGVtcHQiOjJ9LCJsb2dfZGlyZWN0b3J5IjoiL3Zhci9sb2cvcG9kcy83MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZCIsImRuc19jb25maWciOnsic2VydmVycyI6WyIxMC4wLjIyLjQ1IiwiMS4xLjEuMSJdLCJzZWFyY2hlcyI6WyJmaXJlc2NvcGUuaW50Il19LCJsYWJlbHMiOnsiY29tcG9uZW50Ijoia3ViZS1zY2hlZHVsZXIiLCJpby5rdWJlcm5ldGVzLnBvZC5uYW1lIjoia3ViZS1zY2hlZHVsZXItZmlyZXNjb3BlMSIsImlvLmt1YmVybmV0ZXMucG9kLm5hbWVzcGFjZSI6Imt1YmUtc3lzdGVtIiwiaW8ua3ViZXJuZXRlcy5wb2QudWlkIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJ0aWVyIjoiY29udHJvbC1wbGFuZSJ9LCJhbm5vdGF0aW9ucyI6eyJrdWJlcm5ldGVzLmlvL2NvbmZpZy5oYXNoIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJrdWJlcm5ldGVzLmlvL2NvbmZpZy5zZWVuIjoiMjAxOC0xMi0xOVQyMToxODo1NS41MjEwNjE5NzFaIiwia3ViZXJuZXRlcy5pby9jb25maWcuc291cmNlIjoiZmlsZSIsInNjaGVkdWxlci5hbHBoYS5rdWJlcm5ldGVzLmlvL2NyaXRpY2FsLXBvZCI6IiJ9LCJsaW51eCI6eyJjZ3JvdXBfcGFyZW50IjoiL2t1YmVwb2RzL2J1cnN0YWJsZS9wb2Q3MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZCIsInNlY3VyaXR5X2NvbnRleHQiOnsibmFtZXNwYWNlX29wdGlvbnMiOnsibmV0d29yayI6MiwicGlkIjoxfX19fSwiTmV0TlNQYXRoIjoiIiwiSVAiOiIiLCJSdW50aW1lSGFuZGxlciI6IiJ9fQ=="
        }
    },
    "Spec": {
        "ociVersion": "1.0.1-dev",
        "process": {
            "user": {
                "uid": 0,
                "gid": 0
            },
            "args": [
                "/pause"
            ],
            "env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "cwd": "/",
            "capabilities": {
                "bounding": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "effective": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "inheritable": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "permitted": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ]
            },
            "noNewPrivileges": true,
            "oomScoreAdj": -998
        },
        "root": {
            "path": "rootfs",
            "readonly": true
        },
        "mounts": [
            {
                "destination": "/proc",
                "type": "proc",
                "source": "proc",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev"
                ]
            },
            {
                "destination": "/dev",
                "type": "tmpfs",
                "source": "tmpfs",
                "options": [
                    "nosuid",
                    "strictatime",
                    "mode=755",
                    "size=65536k"
                ]
            },
            {
                "destination": "/dev/pts",
                "type": "devpts",
                "source": "devpts",
                "options": [
                    "nosuid",
                    "noexec",
                    "newinstance",
                    "ptmxmode=0666",
                    "mode=0620",
                    "gid=5"
                ]
            },
            {
                "destination": "/dev/shm",
                "type": "tmpfs",
                "source": "shm",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev",
                    "mode=1777",
                    "size=65536k"
                ]
            },
            {
                "destination": "/dev/mqueue",
                "type": "mqueue",
                "source": "mqueue",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev"
                ]
            },
            {
                "destination": "/sys",
                "type": "sysfs",
                "source": "sysfs",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev",
                    "ro"
                ]
            },
            {
                "destination": "/dev/shm",
                "type": "bind",
                "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2/shm",
                "options": [
                    "rbind",
                    "ro"
                ]
            }
        ],
        "annotations": {
            "io.kubernetes.cri.container-type": "sandbox",
            "io.kubernetes.cri.sandbox-id": "139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2"
        },
        "linux": {
            "resources": {
                "devices": [
                    {
                        "allow": false,
                        "access": "rwm"
                    }
                ],
                "cpu": {
                    "shares": 2
                }
            },
            "cgroupsPath": "/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d/139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2",
            "namespaces": [
                {
                    "type": "pid"
                },
                {
                    "type": "ipc"
                },
                {
                    "type": "uts"
                },
                {
                    "type": "mount"
                }
            ],
            "maskedPaths": [
                "/proc/acpi",
                "/proc/asound",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/sys/firmware",
                "/proc/scsi"
            ],
            "readonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        }
    }
}

ctr -n=k8s.io containers info 2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9

{
    "ID": "2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9",
    "Labels": {
        "component": "kube-scheduler",
        "io.cri-containerd.kind": "sandbox",
        "io.kubernetes.pod.name": "kube-scheduler-node1",
        "io.kubernetes.pod.namespace": "kube-system",
        "io.kubernetes.pod.uid": "705e7ce1217a37349a5567101e60165d",
        "tier": "control-plane"
    },
    "Image": "k8s.gcr.io/pause:3.1",
    "Runtime": {
        "Name": "io.containerd.runtime.v1.linux",
        "Options": {
            "type_url": "containerd.linux.runc.RuncOptions"
        }
    },
    "SnapshotKey": "2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9",
    "Snapshotter": "overlayfs",
    "CreatedAt": "2018-12-30T10:23:41.224499667Z",
    "UpdatedAt": "2018-12-30T10:23:41.224499667Z",
    "Extensions": {
        "io.cri-containerd.sandbox.metadata": {
            "type_url": "github.com/containerd/cri/pkg/store/sandbox/Metadata",
            "value": "eyJWZXJzaW9uIjoidjEiLCJNZXRhZGF0YSI6eyJJRCI6IjI0MjhkYTdhZmI3ZmUwOTJlZGIwYTkyNGMyYTgzYjBhYTFjMzdiNzFhMGI1NzJmNDdlMDY0NzU3ZThmMGU3YzkiLCJOYW1lIjoia3ViZS1zY2hlZHVsZXItZmlyZXNjb3BlMV9rdWJlLXN5c3RlbV83MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZF8yIiwiQ29uZmlnIjp7Im1ldGFkYXRhIjp7Im5hbWUiOiJrdWJlLXNjaGVkdWxlci1maXJlc2NvcGUxIiwidWlkIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJuYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImF0dGVtcHQiOjJ9LCJsb2dfZGlyZWN0b3J5IjoiL3Zhci9sb2cvcG9kcy83MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZCIsImRuc19jb25maWciOnsic2VydmVycyI6WyIxMC4wLjIyLjQ1IiwiMS4xLjEuMSJdLCJzZWFyY2hlcyI6WyJmaXJlc2NvcGUuaW50Il19LCJsYWJlbHMiOnsiY29tcG9uZW50Ijoia3ViZS1zY2hlZHVsZXIiLCJpby5rdWJlcm5ldGVzLnBvZC5uYW1lIjoia3ViZS1zY2hlZHVsZXItZmlyZXNjb3BlMSIsImlvLmt1YmVybmV0ZXMucG9kLm5hbWVzcGFjZSI6Imt1YmUtc3lzdGVtIiwiaW8ua3ViZXJuZXRlcy5wb2QudWlkIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJ0aWVyIjoiY29udHJvbC1wbGFuZSJ9LCJhbm5vdGF0aW9ucyI6eyJrdWJlcm5ldGVzLmlvL2NvbmZpZy5oYXNoIjoiNzA1ZTdjZTEyMTdhMzczNDlhNTU2NzEwMWU2MDE2NWQiLCJrdWJlcm5ldGVzLmlvL2NvbmZpZy5zZWVuIjoiMjAxOC0xMi0xOVQyMToxODo1NS41MjEwNjE5NzFaIiwia3ViZXJuZXRlcy5pby9jb25maWcuc291cmNlIjoiZmlsZSIsInNjaGVkdWxlci5hbHBoYS5rdWJlcm5ldGVzLmlvL2NyaXRpY2FsLXBvZCI6IiJ9LCJsaW51eCI6eyJjZ3JvdXBfcGFyZW50IjoiL2t1YmVwb2RzL2J1cnN0YWJsZS9wb2Q3MDVlN2NlMTIxN2EzNzM0OWE1NTY3MTAxZTYwMTY1ZCIsInNlY3VyaXR5X2NvbnRleHQiOnsibmFtZXNwYWNlX29wdGlvbnMiOnsibmV0d29yayI6MiwicGlkIjoxfX19fSwiTmV0TlNQYXRoIjoiIiwiSVAiOiIiLCJSdW50aW1lSGFuZGxlciI6IiJ9fQ=="
        }
    },
    "Spec": {
        "ociVersion": "1.0.1-dev",
        "process": {
            "user": {
                "uid": 0,
                "gid": 0
            },
            "args": [
                "/pause"
            ],
            "env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
            ],
            "cwd": "/",
            "capabilities": {
                "bounding": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "effective": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "inheritable": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ],
                "permitted": [
                    "CAP_CHOWN",
                    "CAP_DAC_OVERRIDE",
                    "CAP_FSETID",
                    "CAP_FOWNER",
                    "CAP_MKNOD",
                    "CAP_NET_RAW",
                    "CAP_SETGID",
                    "CAP_SETUID",
                    "CAP_SETFCAP",
                    "CAP_SETPCAP",
                    "CAP_NET_BIND_SERVICE",
                    "CAP_SYS_CHROOT",
                    "CAP_KILL",
                    "CAP_AUDIT_WRITE"
                ]
            },
            "noNewPrivileges": true,
            "oomScoreAdj": -998
        },
        "root": {
            "path": "rootfs",
            "readonly": true
        },
        "mounts": [
            {
                "destination": "/proc",
                "type": "proc",
                "source": "proc",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev"
                ]
            },
            {
                "destination": "/dev",
                "type": "tmpfs",
                "source": "tmpfs",
                "options": [
                    "nosuid",
                    "strictatime",
                    "mode=755",
                    "size=65536k"
                ]
            },
            {
                "destination": "/dev/pts",
                "type": "devpts",
                "source": "devpts",
                "options": [
                    "nosuid",
                    "noexec",
                    "newinstance",
                    "ptmxmode=0666",
                    "mode=0620",
                    "gid=5"
                ]
            },
            {
                "destination": "/dev/shm",
                "type": "tmpfs",
                "source": "shm",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev",
                    "mode=1777",
                    "size=65536k"
                ]
            },
            {
                "destination": "/dev/mqueue",
                "type": "mqueue",
                "source": "mqueue",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev"
                ]
            },
            {
                "destination": "/sys",
                "type": "sysfs",
                "source": "sysfs",
                "options": [
                    "nosuid",
                    "noexec",
                    "nodev",
                    "ro"
                ]
            },
            {
                "destination": "/dev/shm",
                "type": "bind",
                "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9/shm",
                "options": [
                    "rbind",
                    "ro"
                ]
            }
        ],
        "annotations": {
            "io.kubernetes.cri.container-type": "sandbox",
            "io.kubernetes.cri.sandbox-id": "2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9"
        },
        "linux": {
            "resources": {
                "devices": [
                    {
                        "allow": false,
                        "access": "rwm"
                    }
                ],
                "cpu": {
                    "shares": 2
                }
            },
            "cgroupsPath": "/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d/2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9",
            "namespaces": [
                {
                    "type": "pid"
                },
                {
                    "type": "ipc"
                },
                {
                    "type": "uts"
                },
                {
                    "type": "mount"
                }
            ],
            "maskedPaths": [
                "/proc/acpi",
                "/proc/asound",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/sys/firmware",
                "/proc/scsi"
            ],
            "readonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        }
    }
}

Random-Liu · 2019-01-02T22:21:19Z

The container metadata was never updated;
The container metadata was base64 encoded, and here is the decoded metadata:

{"Version":"v1","Metadata":{"ID":"095fa5bfdd75e411290d11e90ec84bc2dec87b4e15d5c6215761b2518c5f8683","Name":"kube-scheduler-firescope1_kube-system_705e7ce1217a37349a5567101e60165d_0","Config":{"metadata":{"name":"kube-scheduler-firescope1","uid":"705e7ce1217a37349a5567101e60165d","namespace":"kube-system"},"log_directory":"/var/log/pods/705e7ce1217a37349a5567101e60165d","dns_config":{"servers":["10.0.22.45","1.1.1.1"],"searches":["firescope.int"]},"labels":{"component":"kube-scheduler","io.kubernetes.pod.name":"kube-scheduler-firescope1","io.kubernetes.pod.namespace":"kube-system","io.kubernetes.pod.uid":"705e7ce1217a37349a5567101e60165d","tier":"control-plane"},"annotations":{"kubernetes.io/config.hash":"705e7ce1217a37349a5567101e60165d","kubernetes.io/config.seen":"2018-12-19T21:18:55.521061971Z","kubernetes.io/config.source":"file","scheduler.alpha.kubernetes.io/critical-pod":""},"linux":{"cgroup_parent":"/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d","security_context":{"namespace_options":{"network":2,"pid":1}}}},"NetNSPath":"","IP":"","RuntimeHandler":""}}

{"Version":"v1","Metadata":{"ID":"139bb0ac7e050e9e28b994e78f651a8609f426f1b5bbfc887a0d4a3350b4eee2","Name":"kube-scheduler-firescope1_kube-system_705e7ce1217a37349a5567101e60165d_2","Config":{"metadata":{"name":"kube-scheduler-firescope1","uid":"705e7ce1217a37349a5567101e60165d","namespace":"kube-system","attempt":2},"log_directory":"/var/log/pods/705e7ce1217a37349a5567101e60165d","dns_config":{"servers":["10.0.22.45","1.1.1.1"],"searches":["firescope.int"]},"labels":{"component":"kube-scheduler","io.kubernetes.pod.name":"kube-scheduler-firescope1","io.kubernetes.pod.namespace":"kube-system","io.kubernetes.pod.uid":"705e7ce1217a37349a5567101e60165d","tier":"control-plane"},"annotations":{"kubernetes.io/config.hash":"705e7ce1217a37349a5567101e60165d","kubernetes.io/config.seen":"2018-12-19T21:18:55.521061971Z","kubernetes.io/config.source":"file","scheduler.alpha.kubernetes.io/critical-pod":""},"linux":{"cgroup_parent":"/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d","security_context":{"namespace_options":{"network":2,"pid":1}}}},"NetNSPath":"","IP":"","RuntimeHandler":""}}

{"Version":"v1","Metadata":{"ID":"2428da7afb7fe092edb0a924c2a83b0aa1c37b71a0b572f47e064757e8f0e7c9","Name":"kube-scheduler-firescope1_kube-system_705e7ce1217a37349a5567101e60165d_2","Config":{"metadata":{"name":"kube-scheduler-firescope1","uid":"705e7ce1217a37349a5567101e60165d","namespace":"kube-system","attempt":2},"log_directory":"/var/log/pods/705e7ce1217a37349a5567101e60165d","dns_config":{"servers":["10.0.22.45","1.1.1.1"],"searches":["firescope.int"]},"labels":{"component":"kube-scheduler","io.kubernetes.pod.name":"kube-scheduler-firescope1","io.kubernetes.pod.namespace":"kube-system","io.kubernetes.pod.uid":"705e7ce1217a37349a5567101e60165d","tier":"control-plane"},"annotations":{"kubernetes.io/config.hash":"705e7ce1217a37349a5567101e60165d","kubernetes.io/config.seen":"2018-12-19T21:18:55.521061971Z","kubernetes.io/config.source":"file","scheduler.alpha.kubernetes.io/critical-pod":""},"linux":{"cgroup_parent":"/kubepods/burstable/pod705e7ce1217a37349a5567101e60165d","security_context":{"namespace_options":{"network":2,"pid":1}}}},"NetNSPath":"","IP":"","RuntimeHandler":""}}

I don't understand why this could happen... The only possibility I can imagine is that the previous sandbox failed to be loaded for some reason, so another sandbox with the same attempt got created. And later the old sandbox can be loaded again, so we get the naming conflict.

Can you search Failed to load sandbox in your containerd log? If possible, get all containerd log around 2018-12-30T9:50:00 - 2018-12-30T10:30:00.

steven-sheehy · 2019-01-02T22:43:51Z

The farthest my kubelet and containerd logs go back is Jan 02 03:22:26 and looks like containers above are created at 2018-12-30T10:23:41.224499667Z. Journalctl is set to only allow 1Gb of logs so that's why it doesn't go far back. I didn't notice it until back from holidays.

At the time this occurred, the system was unusable and I couldn't even log into it. So it's possible that kubelet created a sandbox then failed right after even loading/using that same sandbox.

Searching for "Failed to load sandbox" doesn't return anything. Is there any other place such events are stored? Looks like kubernetes events are only 1hr TTL.

Random-Liu · 2019-01-02T22:46:57Z

@steven-sheehy OK. Then just remove the bad container with ctr containers rm to bring your node back to normal.

Let's leave the issue open to see whether anyone else hits it.

Random-Liu · 2019-01-02T22:58:26Z

I highly suspect that the problem happened because:

loadSandbox for the old sandbox failed because of some transient error, e.g. containerd-shim unresponsive;
The old sandbox was not loaded because of that, and a new sandbox with the same attempt got created by kubelet;
In a later containerd restart, the old sandbox can be loaded, and now we have 2 sandboxes with the same attempt number, which causes naming conflict.

The issue is that we don't currently have a good way to handle sandbox/container which fails to be loaded. The same issue was discussed in #884 (comment) and kubernetes/kubernetes#69060.

We should probably keep the fail-to-load sandbox/container during recover, and represent them in a special state, e.g. unknown state. Kubelet should be aware of the unknown state, clean it up before starting new sandboxs/containers.

Random-Liu · 2019-01-08T01:05:31Z

Here is some lines from @steven-sheehy's containerd log:

Jan 05 15:42:55 firescope1 containerd[5559]: time="2019-01-05T15:41:21.753486829Z" level=error msg="Failed to load sandbox "4b46041339e133c961683b37db9bfb739968f353645b951acf6d1e2e98d144b4"" error="failed to load task: OCI runtime state failed: runc did not terminate sucessfully: runtime/cgo: pthread_create failed: Resource temporarily unavailable
Jan 05 15:42:55 firescope1 containerd[5559]: SIGABRT: abort
Jan 05 15:42:55 firescope1 containerd[5559]: PC=0x6cdd6e m=0 sigcode=18446744073709551610
Jan 05 15:42:55 firescope1 containerd[5559]: goroutine 0 [idle]:
Jan 05 15:42:55 firescope1 containerd[5559]: runtime: unknown pc 0x6cdd6e
Jan 05 15:42:55 firescope1 containerd[5559]: stack: frame={sp:0x7ffedf0d8d98, fp:0x0} stack=[0x7ffede8da2b8,0x7ffedf0d92e0)
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8c98:  0000000000000001  00007ffedf1bb184
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8ca8:  00000000006b39cc  0000002200000003
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8cb8:  000000000045a3be <runtime.callCgoMmap+62>  00007ffedf0d8cc8
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8cc8:  00007ffedf0d8d18  00007ffedf0d8d18
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8cd8:  000000000040d46b <runtime.persistentalloc1+459>  0000000000cc9100
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8ce8:  0000000000000000  0000000000010000
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8cf8:  0000000000000008  0000000000cc9108
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d08:  0000000000cca080  00007f2b1c5fb000
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d18:  00007ffedf0d8d50  0000000000000000
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d28:  0000000000004000  0000000000000000
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d38:  0000000000cee3c0  00007f2b1c5fb000
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d48:  00007ffedf0d8d68  00007ffedf0d8d98
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d58:  000000000040d282 <runtime.persistentalloc+130>  00007ffedf0d8d70
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d68:  00007ffedf0d8db0  00007ffedf0d8dc0
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d78:  0000000000000040  0000000000000040
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d88:  0000000000000001  00007ffedf1bb184
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8d98: <00000000006b39cc  fffffffe7fffffff
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8da8:  ffffffffffffffff  ffffffffffffffff
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8db8:  ffffffffffffffff  ffffffffffffffff
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8dc8:  ffffffffffffffff  ffffffffffffffff
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8dd8:  ffffffffffffffff  ffffffffffffffff
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8de8:  ffffffffffffffff  ffffffffffffffff
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8df8:  ffffffffffffffff  ffffffffffffffff
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8e08:  ffffffffffffffff  ffffffffffffffff
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8e18:  ffffffffffffffff  0000000000000000
Jan 05 15:42:55 firescope1 containerd[5559]: 00007ffedf0d8e28:  00000000006ce0aa  0000000000000020

This proves my theory #1014 (comment).
In previous runs, the sandbox failed to be loaded because of transient errors, thus a new sandbox with the same attempt gets created.
In the new run, we have 2 sandboxes with the same name, so loading failed.

Apparently your node is in a bad state @steven-sheehy, fixing that bad condition should help you eliminate this issue.

However, we should better handle the case that sandbox/container failed to be loaded. We should still load them and keep them in unknown state. And kubelet should try to stop the sandbox/container before creating and starting new ones.

Random-Liu · 2019-02-05T08:46:22Z

@steven-sheehy Sorry for the delay. It looks like your node runs out of pid, thus you see the pthread_create temporary failure.

#1037 should fix the issue for you.

And I'll fix the Kubernetes issue kubernetes/kubernetes#69060, which should completely fix the issue.

steven-sheehy · 2019-02-05T15:13:52Z

Thanks @Random-Liu. Do you think containerd itself was causing the system to run out of pids or something else on the system? My system has open file descriptors set to 1048576.

Random-Liu · 2019-02-05T18:14:17Z

@steven-sheehy I don't think it is containerd itself. The issue is that containerd skips containers after load failure, which is the containerd problem. But pid exhaustion should be caused by something else on your node, e.g. your workload.

eug48 · 2020-03-19T01:09:14Z

PS: the config option for the workaround is disabled_plugins, not disable_plugins

davidhay1969 · 2021-07-04T16:40:49Z

In case anyone hits this issue in 2021 or beyond, it's worth noting that the plugin name is no longer "just" plain ole cri but now io.containerd.grpc.v1.cri

Therefore, one needs to set: -

disabled_plugins = ["io.containerd.grpc.v1.cri"]

in /etc/containerd/config.toml and restart the containerd service.

I'm using Kubernetes 1.21 with containerd 1.44 and Kata 2.0 on Ubuntu 20.04

Random-Liu added the kind/bug label Jan 2, 2019

steven-sheehy mentioned this issue Jan 3, 2019

Deadlock after containerd restart #1018

Closed

Random-Liu added this to the v1.3 milestone Jan 8, 2019

Random-Liu added the priority/P0 label Jan 10, 2019

This was referenced Feb 5, 2019

[release/1.2] Prepare v1.2.3 containerd/containerd#2974

Merged

Support unknown state #1037

Merged

Random-Liu closed this as completed in #1037 Feb 5, 2019

ktsakalozos mentioned this issue Jun 17, 2019

failed to recover state: failed to reserve sandbox name canonical/microk8s#508

Closed

Random-Liu modified the milestones: v1.3, v1.2 Aug 5, 2019

ponderMuse mentioned this issue Oct 23, 2020

Failed to Reserve Sandbox Name canonical/microk8s#1666

Closed

RyanZhaoXB mentioned this issue Feb 10, 2023

edge node join fialed with version v1.13.0 kubeedge/kubeedge#4589

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'failed to reserve sandbox name' error after hard reboot #1014

'failed to reserve sandbox name' error after hard reboot #1014

steven-sheehy commented Jan 2, 2019 •

edited

Loading

Random-Liu commented Jan 2, 2019 •

edited

Loading

mikebrow commented Jan 2, 2019 •

edited

Loading

Random-Liu commented Jan 2, 2019

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019

steven-sheehy commented Jan 2, 2019

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019 via email

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019 •

edited

Loading

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019 •

edited

Loading

Random-Liu commented Jan 2, 2019

Random-Liu commented Jan 8, 2019 •

edited

Loading

Random-Liu commented Feb 5, 2019 •

edited

Loading

steven-sheehy commented Feb 5, 2019

Random-Liu commented Feb 5, 2019

eug48 commented Mar 19, 2020

davidhay1969 commented Jul 4, 2021

'failed to reserve sandbox name' error after hard reboot #1014

'failed to reserve sandbox name' error after hard reboot #1014

Comments

steven-sheehy commented Jan 2, 2019 • edited Loading

Random-Liu commented Jan 2, 2019 • edited Loading

mikebrow commented Jan 2, 2019 • edited Loading

Random-Liu commented Jan 2, 2019

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019

steven-sheehy commented Jan 2, 2019

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019 via email

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019 • edited Loading

steven-sheehy commented Jan 2, 2019

Random-Liu commented Jan 2, 2019 • edited Loading

Random-Liu commented Jan 2, 2019

Random-Liu commented Jan 8, 2019 • edited Loading

Random-Liu commented Feb 5, 2019 • edited Loading

steven-sheehy commented Feb 5, 2019

Random-Liu commented Feb 5, 2019

eug48 commented Mar 19, 2020

davidhay1969 commented Jul 4, 2021

steven-sheehy commented Jan 2, 2019 •

edited

Loading

Random-Liu commented Jan 2, 2019 •

edited

Loading

mikebrow commented Jan 2, 2019 •

edited

Loading

Random-Liu commented Jan 2, 2019 •

edited

Loading

Random-Liu commented Jan 2, 2019 •

edited

Loading

Random-Liu commented Jan 8, 2019 •

edited

Loading

Random-Liu commented Feb 5, 2019 •

edited

Loading