Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes checkpoint-restore process failed #2366

Closed
tonyliu666 opened this issue Mar 15, 2024 · 41 comments
Closed

kubernetes checkpoint-restore process failed #2366

tonyliu666 opened this issue Mar 15, 2024 · 41 comments

Comments

@tonyliu666
Copy link

tonyliu666 commented Mar 15, 2024

Description

The checkpoint tar file can be produced correctly. But when trying to restore the pod in another node, It failed with the error "msg="criu failed: type NOTIFY errno 0". I follow the steps on the link (https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/), and I stucked on the final step which deploys the checkpointed pod on the new node.

Describe the results you received:

  • kubelet Error: failed to restore container counters: container restore failed: time="2024-03-15T08:46:56Z' level=error msg=" criu failed: type NOTIFY errno 0\nlog file: /run/containers/storage/overlay-containers/counters/userdata/restore.log""

Describe the results you expected:
Successfully running the new pod in another node

Additional information you deem important (e.g. issue happens only occasionally):
run k8s cluster on vagrant VMs, crio version 1.28.0 and criu version: 3.19

CRIU logs and information:

dump.log

By the way I can't see the /run/containers/storage/overlay-containers/counters/userdata/restore.log on the node running the checkpointed pod
@adrianreber
Copy link
Member

Please try with cgroupv1. Restoring containers with Kubernetes currently does not always work with cgroupv2. We are waiting for opencontainers/runc#3546 to be part of a runc release.

The missing restore.log is a known problem. If switching to cgroupv1 does not help, please share the CRIU log file.

You can do echo "log-file /tmp/criu.log" >> /etc/criu/runc.conf. Then you should be able to get a log file.

@tonyliu666
Copy link
Author

tonyliu666 commented Mar 16, 2024

Thanks for your help, it works!

@HFourier
Copy link

I have an other question. Recently, I try to restore my restored image, it can successfully run the new pod in local node but failed in another node. I use kubectl describe pod <podname> , The events are as follows:
Events:
Type Reason Age From Message


Normal Pulled 6s (x3 over 18s) kubelet Container image "docker.io/wuch100519/migrations-24311a:dump" already present on machine
Warning Failed 6s (x3 over 18s) kubelet Error: image not known

@HFourier
Copy link

I have an other question. Recently, I try to restore my restored image, it can successfully run the new pod in local node but failed in another node. I use kubectl describe pod <podname> , The events are as follows: Events: Type Reason Age From Message

Normal Pulled 6s (x3 over 18s) kubelet Container image "docker.io/wuch100519/migrations-24311a:dump" already present on machine Warning Failed 6s (x3 over 18s) kubelet Error: image not known

I try to rebuild new image, or use local image instead of pulling from docker hub, but not works.

@adrianreber
Copy link
Member

What CRI-O currently does not do is to pull missing images during restore. So you need to do a crictl pull of the image your checkpoint is based on. Looking at you checkpoint image it should be something like:

crictl pull docker.io/wuch100519/new0306_migrations:2.0

@HFourier
Copy link

What CRI-O currently does not do is to pull missing images during restore. So you need to do a crictl pull of the image your checkpoint is based on. Looking at you checkpoint image it should be something like:

crictl pull docker.io/wuch100519/new0306_migrations:2.0

Thanks very very very much, it greatly help me !!!!

@HFourier
Copy link

HFourier commented Apr 8, 2024

Please try with cgroupv1. Restoring containers with Kubernetes currently does not always work with cgroupv2. We are waiting for opencontainers/runc#3546 to be part of a runc release.

The missing restore.log is a known problem. If switching to cgroupv1 does not help, please share the CRIU log file.

You can do echo "log-file /tmp/criu.log" >> /etc/criu/runc.conf. Then you should be able to get a log file.

When I try again in another machine, I get log as follows, I don't know what happened.

(00.000000) Parsing config file /etc/criu/default.conf
(00.000000) Unable to get $HOME directory, local configuration file will not be used.
(00.000000) Parsing config file /etc/criu/runc.conf
(00.000040) Version: 3.19 (gitid e025b7e)
(00.000045) Running on enb Linux 5.4.0-81-lowlatency #91-Ubuntu SMP PREEMPT Thu Jul 15 20:15:41 UTC 2021 x86_64
(00.000047) Would overwrite RPC settings with values from /etc/criu/runc.conf
(00.000053) Effective capability 40 missing
(00.000061) Loaded kdat cache from /run/criu.kdat
(00.000077) Hugetlb size 2 Mb is supported but cannot get dev's number
(00.000853) ptrace(PTRACE_GET_RSEQ_CONFIGURATION) is not supported
(00.001651) Added ipc:/var/run/ipcns/f944b8be-5304-42eb-8d18-9e268d7fd521 join namespace
(00.001667) Added uts:/var/run/utsns/f944b8be-5304-42eb-8d18-9e268d7fd521 join namespace
(00.001691) Parsing config file /etc/criu/runc.conf
(00.001725) mnt-v2: Mounts-v2 requires MOVE_MOUNT_SET_GROUP support
(00.001727) Mount engine fallback to --mntns-compat-mode mode
(00.001740) rlimit: RLIMIT_NOFILE unlimited for self
(00.001810) cpu: x86_family 6 x86_vendor_id GenuineIntel x86_model_id Intel(R) Xeon(R) D-1747NTE CPU @ 2.50GHz
(00.001814) cpu: fpu: xfeatures_mask 0x2e5 xsave_size 2696 xsave_size_max 2696 xsaves_size 2440
(00.001819) cpu: fpu: x87 floating point registers     xstate_offsets      0 / 0      xstate_sizes    160 / 160   
(00.001822) cpu: fpu: AVX registers                    xstate_offsets    576 / 576    xstate_sizes    256 / 256   
(00.001824) cpu: fpu: AVX-512 opmask                   xstate_offsets   1088 / 832    xstate_sizes     64 / 64    
(00.001826) cpu: fpu: AVX-512 Hi256                    xstate_offsets   1152 / 896    xstate_sizes    512 / 512   
(00.001829) cpu: fpu: AVX-512 ZMM_Hi256                xstate_offsets   1664 / 1408   xstate_sizes   1024 / 1024  
(00.001831) cpu: fpu: Protection Keys User registers   xstate_offsets   2688 / 2432   xstate_sizes      8 / 8     
(00.001833) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1
(00.001865) kernel pid_max=4194304
(00.001867) Reading image tree
(00.001886) Add mnt ns 13 pid 1
(00.001889) Add net ns 10 pid 1
(00.001890) Add pid ns 9 pid 1
(00.001893) pstree pid_max=1
(00.001897) Will restore in 6c020000 namespaces
(00.001898) NS mask to use 6c020000
(00.001940) Collecting 51/56 (flags 3)
(00.001948) No memfd.img image
(00.001951)  `- ... done
(00.001954) Collecting 40/54 (flags 2)
(00.001965) Collected [usr/local/bin/python3.9] ID 0x1
(00.001969) Collected [lib/x86_64-linux-gnu/libnss_files-2.31.so] ID 0x2
(00.001972) Collected [usr/local/lib/python3.9/lib-dynload/unicodedata.cpython-39-x86_64-linux-gnu.so] ID 0x3
(00.001975) Collected [lib/x86_64-linux-gnu/liblzma.so.5.2.5] ID 0x4
(00.001977) Collected [usr/local/lib/python3.9/lib-dynload/_lzma.cpython-39-x86_64-linux-gnu.so] ID 0x5
(00.001980) Collected [lib/x86_64-linux-gnu/libbz2.so.1.0.4] ID 0x6
(00.001984) Collected [usr/local/lib/python3.9/lib-dynload/grp.cpython-39-x86_64-linux-gnu.so] ID 0x7
(00.001986) Collected [usr/local/lib/python3.9/lib-dynload/zlib.cpython-39-x86_64-linux-gnu.so] ID 0x8
(00.001990) Collected [usr/lib/x86_64-linux-gnu/libcrypto.so.1.1] ID 0x9
(00.001993) Collected [usr/lib/x86_64-linux-gnu/libssl.so.1.1] ID 0xa
(00.001998) Collected [usr/local/lib/python3.9/lib-dynload/_ssl.cpython-39-x86_64-linux-gnu.so] ID 0xb
(00.002001) Collected [lib/x86_64-linux-gnu/libz.so.1.2.11] ID 0xc
(00.002004) Collected [usr/local/lib/python3.9/lib-dynload/binascii.cpython-39-x86_64-linux-gnu.so] ID 0xd
(00.002007) Collected [usr/local/lib/python3.9/lib-dynload/_struct.cpython-39-x86_64-linux-gnu.so] ID 0xe
(00.002010) Collected [usr/local/lib/python3.9/lib-dynload/_socket.cpython-39-x86_64-linux-gnu.so] ID 0xf
(00.002013) Collected [usr/local/lib/python3.9/lib-dynload/_sha512.cpython-39-x86_64-linux-gnu.so] ID 0x10
(00.002016) Collected [usr/local/lib/python3.9/lib-dynload/_random.cpython-39-x86_64-linux-gnu.so] ID 0x11
(00.002019) Collected [usr/local/lib/python3.9/lib-dynload/_bisect.cpython-39-x86_64-linux-gnu.so] ID 0x12
(00.002022) Collected [usr/local/lib/python3.9/lib-dynload/_heapq.cpython-39-x86_64-linux-gnu.so] ID 0x13
(00.002037) Collected [usr/local/lib/python3.9/lib-dynload/_datetime.cpython-39-x86_64-linux-gnu.so] ID 0x14
(00.002041) Collected [usr/local/lib/python3.9/lib-dynload/math.cpython-39-x86_64-linux-gnu.so] ID 0x15
(00.002043) Collected [usr/local/lib/python3.9/lib-dynload/_bz2.cpython-39-x86_64-linux-gnu.so] ID 0x16
(00.002046) Collected [usr/local/lib/python3.9/lib-dynload/array.cpython-39-x86_64-linux-gnu.so] ID 0x17
(00.002053) Collected [usr/local/lib/python3.9/lib-dynload/select.cpython-39-x86_64-linux-gnu.so] ID 0x18
(00.002056) Collected [usr/lib/locale/C.UTF-8/LC_CTYPE] ID 0x19
(00.002059) Collected [lib/x86_64-linux-gnu/libm-2.31.so] ID 0x1a
(00.002062) Collected [lib/x86_64-linux-gnu/libutil-2.31.so] ID 0x1b
(00.002067) Collected [lib/x86_64-linux-gnu/libdl-2.31.so] ID 0x1c
(00.002070) Collected [lib/x86_64-linux-gnu/libpthread-2.31.so] ID 0x1d
(00.002073) Collected [lib/x86_64-linux-gnu/libc-2.31.so] ID 0x1e
(00.002075) Collected [usr/local/lib/libpython3.9.so.1.0] ID 0x1f
(00.002078) Collected [usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache] ID 0x20
(00.002083) Collected [lib/x86_64-linux-gnu/ld-2.31.so] ID 0x21
(00.002086) Collected [dev/null] ID 0x22
(00.002090) Collected pipe entry ID 0x23 PIPE ID 0x13baf6f5
(00.002097) Found id pipe:[331020021] (fd 2) in inherit fd list
(00.002101) Collected pipe entry ID 0x24 PIPE ID 0x13baf6f6
(00.002105) Found id pipe:[331020022] (fd 4) in inherit fd list
(00.002115) Collected [app] ID 0x26
(00.002118) Collected [.] ID 0x27
(00.002121)  `- ... done
(00.002123) Collecting 46/68 (flags 0)
(00.002128) No remap-fpath.img image
(00.002130)  `- ... done
(00.002135) No apparmor.img image
(00.002227) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002235) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002240) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002245) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002250) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002254) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002264) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002270) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002275) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002280) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002285) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002290) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002294) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002298) cg: Preparing cgroups yard (cgroups restore mode 0x4)
(00.002455) cg: Opening .criu.cgyard.05WiQs as cg yard
(00.002464) cg:         Making controller dir .criu.cgyard.05WiQs/freezer (freezer)
(00.002700) cg: Determined cgroup dir freezer//kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope already exist
(00.002709) cg: Skip restoring properties on cgroup dir freezer//kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podceffe4e8_81a7_40d0_8ad7_d4ca8309a410.slice/crio-468018e0242cde909ceea8aa86c18c185fee3c1845fec36df56124ae52e6011f.scope
(00.002725) cg:         Making controller dir .criu.cgyard.05WiQs/misc (misc)
(00.003123) Error (criu/cgroup.c:1931): cg:     Can't mount controller dir .criu.cgyard.05WiQs/misc: Invalid argument

@adrianreber
Copy link
Member

adrianreber commented Apr 8, 2024

Is this on a cgroup v1 or v2 system?

@HFourier
Copy link

HFourier commented Apr 8, 2024

Is this on a cgroup v1 or v2 system?

I am sure it is on v1 system, I have rum stat -fc %T /sys/fs/cgroup/ and it return tmpfs.

@adrianreber
Copy link
Member

Please attach the CRIU log during the checkpointing.

@HFourier
Copy link

HFourier commented Apr 8, 2024

OK,When I finished checkpoint by `curl -X POST "https://166.111.xxx.190:10250/checkpoint/default/migrations-319a/migration-319a" --insecure --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt --key /etc/kubernetes/pki/apiserver-kubelet-client.key

(00.000000) Parsing config file /etc/criu/default.conf
(00.000000) Unable to get $HOME directory, local configuration file will not be used.
(00.000000) Parsing config file /etc/criu/runc.conf
(00.000043) Version: 3.19 (gitid e025b7e)
(00.000048) Running on enb Linux 5.4.0-81-lowlatency #91-Ubuntu SMP PREEMPT Thu Jul 15 20:15:41 UTC 2021 x86_64
(00.000050) Would overwrite RPC settings with values from /etc/criu/runc.conf
(00.000057) Effective capability 40 missing
(00.000064) Loaded kdat cache from /run/criu.kdat
(00.000076) Hugetlb size 2 Mb is supported but cannot get dev's number
(00.000294) ptrace(PTRACE_GET_RSEQ_CONFIGURATION) is not supported
(00.000515) Added ipc:/var/run/ipcns/8fe1f1f4-7af4-46ee-9c67-ceb4b441d8c6 join namespace
(00.000523) Added uts:/var/run/utsns/8fe1f1f4-7af4-46ee-9c67-ceb4b441d8c6 join namespace
(00.000541) Parsing config file /etc/criu/runc.conf
(00.000571) mnt-v2: Mounts-v2 requires MOVE_MOUNT_SET_GROUP support
(00.000578) Mount engine fallback to --mntns-compat-mode mode
(00.000589) rlimit: RLIMIT_NOFILE unlimited for self
(00.000638) cpu: x86_family 6 x86_vendor_id GenuineIntel x86_model_id Intel(R) Xeon(R) D-1747NTE CPU @ 2.50GHz
(00.000642) cpu: fpu: xfeatures_mask 0x2e5 xsave_size 2696 xsave_size_max 2696 xsaves_size 2440
(00.000648) cpu: fpu: x87 floating point registers     xstate_offsets      0 / 0      xstate_sizes    160 / 160   
(00.000651) cpu: fpu: AVX registers                    xstate_offsets    576 / 576    xstate_sizes    256 / 256   
(00.000653) cpu: fpu: AVX-512 opmask                   xstate_offsets   1088 / 832    xstate_sizes     64 / 64    
(00.000655) cpu: fpu: AVX-512 Hi256                    xstate_offsets   1152 / 896    xstate_sizes    512 / 512   
(00.000657) cpu: fpu: AVX-512 ZMM_Hi256                xstate_offsets   1664 / 1408   xstate_sizes   1024 / 1024  
(00.000659) cpu: fpu: Protection Keys User registers   xstate_offsets   2688 / 2432   xstate_sizes      8 / 8     
(00.000661) cpu: fpu:1 fxsr:1 xsave:1 xsaveopt:1 xsavec:1 xgetbv1:1 xsaves:1
(00.000681) kernel pid_max=4194304
(00.000683) Reading image tree
(00.000704) Add mnt ns 13 pid 1
(00.000706) Add net ns 10 pid 1
(00.000709) Add pid ns 9 pid 1
(00.000712) pstree pid_max=1
(00.000716) Will restore in 6c020000 namespaces
(00.000718) NS mask to use 6c020000
(00.000756) Collecting 51/56 (flags 3)
(00.000765) No memfd.img image
(00.000768)  `- ... done
(00.000770) Collecting 40/54 (flags 2)
(00.000780) Collected [usr/local/bin/python3.9] ID 0x1
(00.000784) Collected [lib/x86_64-linux-gnu/libnss_files-2.31.so] ID 0x2
(00.000788) Collected [usr/local/lib/python3.9/lib-dynload/unicodedata.cpython-39-x86_64-linux-gnu.so] ID 0x3
(00.000791) Collected [lib/x86_64-linux-gnu/liblzma.so.5.2.5] ID 0x4
(00.000794) Collected [usr/local/lib/python3.9/lib-dynload/_lzma.cpython-39-x86_64-linux-gnu.so] ID 0x5
(00.000796) Collected [lib/x86_64-linux-gnu/libbz2.so.1.0.4] ID 0x6
(00.000801) Collected [usr/local/lib/python3.9/lib-dynload/grp.cpython-39-x86_64-linux-gnu.so] ID 0x7
(00.000804) Collected [usr/local/lib/python3.9/lib-dynload/zlib.cpython-39-x86_64-linux-gnu.so] ID 0x8
(00.000806) Collected [usr/lib/x86_64-linux-gnu/libcrypto.so.1.1] ID 0x9
(00.000810) Collected [usr/lib/x86_64-linux-gnu/libssl.so.1.1] ID 0xa
(00.000814) Collected [usr/local/lib/python3.9/lib-dynload/_ssl.cpython-39-x86_64-linux-gnu.so] ID 0xb
(00.000817) Collected [lib/x86_64-linux-gnu/libz.so.1.2.11] ID 0xc
(00.000819) Collected [usr/local/lib/python3.9/lib-dynload/binascii.cpython-39-x86_64-linux-gnu.so] ID 0xd
(00.000822) Collected [usr/local/lib/python3.9/lib-dynload/_struct.cpython-39-x86_64-linux-gnu.so] ID 0xe
(00.000825) Collected [usr/local/lib/python3.9/lib-dynload/_socket.cpython-39-x86_64-linux-gnu.so] ID 0xf
(00.000828) Collected [usr/local/lib/python3.9/lib-dynload/_sha512.cpython-39-x86_64-linux-gnu.so] ID 0x10
(00.000830) Collected [usr/local/lib/python3.9/lib-dynload/_random.cpython-39-x86_64-linux-gnu.so] ID 0x11
(00.000833) Collected [usr/local/lib/python3.9/lib-dynload/_bisect.cpython-39-x86_64-linux-gnu.so] ID 0x12
(00.000837) Collected [usr/local/lib/python3.9/lib-dynload/_heapq.cpython-39-x86_64-linux-gnu.so] ID 0x13
(00.000851) Collected [usr/local/lib/python3.9/lib-dynload/_datetime.cpython-39-x86_64-linux-gnu.so] ID 0x14
(00.000854) Collected [usr/local/lib/python3.9/lib-dynload/math.cpython-39-x86_64-linux-gnu.so] ID 0x15
(00.000856) Collected [usr/local/lib/python3.9/lib-dynload/_bz2.cpython-39-x86_64-linux-gnu.so] ID 0x16
(00.000860) Collected [usr/local/lib/python3.9/lib-dynload/array.cpython-39-x86_64-linux-gnu.so] ID 0x17
(00.000866) Collected [usr/local/lib/python3.9/lib-dynload/select.cpython-39-x86_64-linux-gnu.so] ID 0x18
(00.000868) Collected [usr/lib/locale/C.UTF-8/LC_CTYPE] ID 0x19
(00.000871) Collected [lib/x86_64-linux-gnu/libm-2.31.so] ID 0x1a
(00.000874) Collected [lib/x86_64-linux-gnu/libutil-2.31.so] ID 0x1b
(00.000878) Collected [lib/x86_64-linux-gnu/libdl-2.31.so] ID 0x1c
(00.000881) Collected [lib/x86_64-linux-gnu/libpthread-2.31.so] ID 0x1d
(00.000884) Collected [lib/x86_64-linux-gnu/libc-2.31.so] ID 0x1e
(00.000887) Collected [usr/local/lib/libpython3.9.so.1.0] ID 0x1f
(00.000890) Collected [usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache] ID 0x20
(00.000895) Collected [lib/x86_64-linux-gnu/ld-2.31.so] ID 0x21
(00.000898) Collected [dev/null] ID 0x22
(00.000901) Collected pipe entry ID 0x23 PIPE ID 0x13baf6f5
(00.000908) Found id pipe:[331020021] (fd 2) in inherit fd list
(00.000912) Collected pipe entry ID 0x24 PIPE ID 0x13baf6f6
(00.000915) Found id pipe:[331020022] (fd 4) in inherit fd list
(00.000924) Collected [app] ID 0x26
(00.000927) Collected [.] ID 0x27
(00.000929)  `- ... done
(00.000931) Collecting 46/68 (flags 0)
(00.000936) No remap-fpath.img image
(00.000938)  `- ... done
(00.000942) No apparmor.img image
(00.001014) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001021) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001025) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001029) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001033) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001038) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001047) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001051) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001056) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001061) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001065) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001069) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001073) cg: rewriting kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podba4188d6_9938_4bd4_ad08_add4f7ff08a8.slice/crio-b011183a6eacc15dbc967b90d67a5162abb9684d9def56abf453e1d8ad979e89.scope to /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001076) cg: Preparing cgroups yard (cgroups restore mode 0x4)
(00.001217) cg: Opening .criu.cgyard.PUkQtf as cg yard
(00.001229) cg:         Making controller dir .criu.cgyard.PUkQtf/freezer (freezer)
(00.001517) cg: Determined cgroup dir freezer//kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope already exist
(00.001527) cg: Skip restoring properties on cgroup dir freezer//kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod14358ee2_24b5_4aab_8041_1299825fa0fb.slice/crio-e05ab31bd1854bef4adfcaedd8713d584d8d9a7ea4751e026928b092258b359b.scope
(00.001563) cg:         Making controller dir .criu.cgyard.PUkQtf/misc (misc)
(00.001985) Error (criu/cgroup.c:1931): cg:     Can't mount controller dir .criu.cgyard.PUkQtf/misc: Invalid argument

@adrianreber
Copy link
Member

That is not the log file during the checkpointing.

@HFourier
Copy link

HFourier commented Apr 8, 2024

That is not the log file during the checkpointing.

How can I get it ?

@tonyliu666
Copy link
Author

Hello, @adrianreber. I have an issue here. Now I would like to checkpoint the postgresql pod as an experiement. Checkpointing the pod on the specific node can work. But when I tried to build the image based on the tar file and pushed it to DockerHub, then I deployed a pod on new node using that image and found an error. Warning Failed 31m (x8 over 32m) kubelet Error: image not known This is the event when using the command,kubectl describe "checkpoint-pod". Here is the log file:
dump.log
I followed all the checkpoting and restore steps in this link, https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha. And this is the yaml file for postgresql pod.
postgresql.txt

@tonyliu666 tonyliu666 reopened this May 26, 2024
@adrianreber
Copy link
Member

That is a known regression in CRI-O. You need to manually pull the image the checkpoint is based on.

@tonyliu666
Copy link
Author

What do you mean manually pull the image? I guess I ssh into the worker node I want to migrate the pod on, and then use crictl pull command to pull that image on DockerHub. But in my implementation, I want to let the whole checkpoint and restore processes occur automatically(I wrote the checkpoint CRD), so I need to use kubectl apply -f to let CRI-O pull the image for me. Is there any alternative that I can use to avoid this issue?

@adrianreber
Copy link
Member

I am not aware of an alternative. As I said, that used to work and currently is not working.

@tonyliu666
Copy link
Author

Ok, got it. Thank you.

@tonyliu666
Copy link
Author

Hello, @adrianreber. I have an issue here. Now I would like to checkpoint the postgresql pod as an experiement. Checkpointing the pod on the specific node can work. But when I tried to build the image based on the tar file and pushed it to DockerHub, then I deployed a pod on new node using that image and found an error. Warning Failed 31m (x8 over 32m) kubelet Error: image not known This is the event when using the command,kubectl describe "checkpoint-pod". Here is the log file: dump.log I followed all the checkpoting and restore steps in this link, https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha. And this is the yaml file for postgresql pod. postgresql.txt

Hello @adrianreber , I would like to mention the unresolved issue again. Now, I've found the logs from kubelet on workernode related with this error. Here are the snippets of logs, the checkpoint-new23:latest is the checkpointed image I manually pull. The kubelet said that /var/lib/containers/storage/overlay/89655a67ca2d1e6cbd28a2cd61a8077cb443ab960d54209a00161fea9afa480d/merged/spec.dump doesn't exist. Is it the possible reason for CreateContainerError(image not found) instead of the crio known regression? And also I found all the folders under the /var/lib/containers/storage/overlay don't have spec.dump file, is it normal?

Jul 07 09:15:38 workernode02 kubelet[941]: E0707 09:15:38.712076 941 kuberuntime_manager.go:1261] container &Container{Name:redis11,Image:localhost/checkpoint-new23:latest,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-q27l4,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:Always,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod red112_default(13d347c5-4ac8-41ee-878d-7d71cda142be): CreateContainerError: failed to read "spec.dump": open /var/lib/containers/storage/overlay/89655a67ca2d1e6cbd28a2cd61a8077cb443ab960d54209a00161fea9afa480d/merged/spec.dump: no such file or directory

@tonyliu666 tonyliu666 reopened this Jul 7, 2024
@adrianreber
Copy link
Member

@tonyliu666 Can you push your checkpoint image to a registry? Then I can take a look.

@tonyliu666
Copy link
Author

Yes, I just store this image in the container storage in crio. These are the information.

root@workernode02:/home/vagrant# crictl images | grep localhost/checkpoint-new23
localhost/checkpoint-new23 latest baf197ce4885e 19.8MB
root@workernode02:/home/vagrant# buildah inspect localhost/checkpoint-new23
{
"Type": "buildah 0.0.1",
"FromImage": "localhost/checkpoint-new23:latest",
"FromImageID": "baf197ce4885e86ff544f19c859b5358cc2c2e6241c8616f90d7cc08268ca7c7",
"FromImageDigest": "sha256:d4a40e11e62fa95a9a0b4785ddf928b78789b6b3711de0b5b71e7e76c1ee40cd",
"Config": "{"created":"2024-07-07T15:38:22.281501707+08:00","architecture":"amd64","os":"linux","config":{},"rootfs":{"type":"layers","diff_ids":["sha256:89655a67ca2d1e6cbd28a2cd61a8077cb443ab960d54209a00161fea9afa480d"]},"history":[{"created":"2024-07-07T15:38:22.281501772+08:00","created_by":"cr-agent"}]}",
"Manifest": "{"schemaVersion":2,"mediaType":"applications/vnd.oci.image.manifest.v1+json","config":{"mediaType":"application/vnd.oci.image.config.v1+json","digest":"sha256:baf197ce4885e86ff544f19c859b5358cc2c2e6241c8616f90d7cc08268ca7c7","size":297},"layers":[{"mediaType":"application/vnd.oci.image.layer.v1.tar+gzip","digest":"sha256:ae0e5dd26e44a48a6eed51f70885151025501ea3b87049cda261aee14c50ae08","size":4595852}],"annotations":{"io.kubernetes.cri-o.annotations.checkpoint.name":"flask-server","org.opencontainers.image.base.digest":"","org.opencontainers.image.base.name":""}}",
"Container": "",
"ContainerID": "",
"MountPoint": "",
"ProcessLabel": "",
"MountLabel": "",
"ImageAnnotations": {
"io.kubernetes.cri-o.annotations.checkpoint.name": "flask-server",
"org.opencontainers.image.base.digest": "",
"org.opencontainers.image.base.name": ""
},
"ImageCreatedBy": "",
"OCIv1": {
"created": "2024-07-07T15:38:22.281501707+08:00",
"architecture": "amd64",
"os": "linux",
"config": {},
"rootfs": {
"type": "layers",
"diff_ids": [
"sha256:89655a67ca2d1e6cbd28a2cd61a8077cb443ab960d54209a00161fea9afa480d"
]
},
"history": [
{
"created": "2024-07-07T15:38:22.281501772+08:00",
"created_by": "cr-agent"
}
]
},
"Docker": {
"created": "2024-07-07T15:38:22.281501707+08:00",
"container_config": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": null,
"Cmd": null,
"Image": "",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": null,
"OnBuild": null,
"Labels": null
},
"config": {
"Hostname": "",
"Domainname": "",
"User": "",
"AttachStdin": false,
"AttachStdout": false,
"AttachStderr": false,
"Tty": false,
"OpenStdin": false,
"StdinOnce": false,
"Env": null,
"Cmd": null,
"Image": "",
"Volumes": null,
"WorkingDir": "",
"Entrypoint": null,
"OnBuild": null,
"Labels": null
},
"architecture": "amd64",
"os": "linux",
"rootfs": {
"type": "layers",
"diff_ids": [
"sha256:89655a67ca2d1e6cbd28a2cd61a8077cb443ab960d54209a00161fea9afa480d"
]
},
"history": [
{
"created": "2024-07-07T15:38:22.281501772+08:00",
"created_by": "cr-agent"
}
]
},
"DefaultMountsFilePath": "",
"Isolation": "oci",
"NamespaceOptions": [
{
"Name": "cgroup",
"Host": true,
"Path": ""
},
{
"Name": "ipc",
"Host": false,
"Path": ""
},
{
"Name": "mount",
"Host": false,
"Path": ""
},
{
"Name": "network",
"Host": false,
"Path": ""
},
{
"Name": "pid",
"Host": false,
"Path": ""
},
{
"Name": "user",
"Host": true,
"Path": ""
},
{
"Name": "uts",
"Host": false,
"Path": ""
}
],
"Capabilities": null,
"ConfigureNetwork": "NetworkDefault",
"CNIPluginPath": "",
"CNIConfigDir": "",
"IDMappingOptions": {
"HostUIDMapping": true,
"HostGIDMapping": true,
"UIDMap": [],
"GIDMap": []
},
"History": [
{
"created": "2024-07-07T15:38:22.281501772+08:00",
"created_by": "cr-agent"
}
],
"Devices": null
}

@adrianreber
Copy link
Member

Can you push the image to a public registry. I would like to download it and inspect it locally.

@tonyliu666
Copy link
Author

Sure, here is the link for the image, https://hub.docker.com/r/tonyliu666/checkpoint-new23.

@rst0git
Copy link
Member

rst0git commented Jul 8, 2024

@tonyliu666 The checkpoint seems to be created correctly and it does contain spec.dump. Would it be possible to provide more information about the error message from CRI-O?

$ sudo tar tvf cr.tar 
-rw------- 0/0            1120 2024-07-07 08:03 bind.mounts
drw------- 0/0               0 2024-07-07 08:03 checkpoint/
-rw-r--r-- 0/0            8804 2024-07-07 08:03 checkpoint/cgroup.img
-rw-r--r-- 0/0            1986 2024-07-07 08:03 checkpoint/core-1.img
-rw-r--r-- 0/0             718 2024-07-07 08:03 checkpoint/core-17.img
-rw-r--r-- 0/0             712 2024-07-07 08:03 checkpoint/core-18.img
-rw-r--r-- 0/0             717 2024-07-07 08:03 checkpoint/core-19.img
-rw-r--r-- 0/0             712 2024-07-07 08:03 checkpoint/core-20.img
-rw------- 0/0              47 2024-07-07 08:03 checkpoint/descriptors.json
-rw-r--r-- 0/0             104 2024-07-07 08:03 checkpoint/fdinfo-2.img
-rw-r--r-- 0/0            1228 2024-07-07 08:03 checkpoint/files.img
-rw-r--r-- 0/0              18 2024-07-07 08:03 checkpoint/fs-1.img
-rw-r--r-- 0/0              36 2024-07-07 08:03 checkpoint/ids-1.img
-rw-r--r-- 0/0              46 2024-07-07 08:03 checkpoint/inventory.img
-rw-r--r-- 0/0              82 2024-07-07 08:03 checkpoint/ipcns-var-11.img
-rw-r--r-- 0/0            1858 2024-07-07 08:03 checkpoint/mm-1.img
-rw-r--r-- 0/0            6847 2024-07-07 08:03 checkpoint/mountpoints-13.img
-rw-r--r-- 0/0              26 2024-07-07 08:03 checkpoint/netns-10.img
-rw-r--r-- 0/0            1198 2024-07-07 08:03 checkpoint/pagemap-1.img
-rw-r--r-- 0/0         2695168 2024-07-07 08:03 checkpoint/pages-1.img
-rw-r--r-- 0/0              22 2024-07-07 08:03 checkpoint/pipes-data.img
-rw-r--r-- 0/0              30 2024-07-07 08:03 checkpoint/pstree.img
-rw-r--r-- 0/0              12 2024-07-07 08:03 checkpoint/seccomp.img
-rw-r--r-- 0/0              36 2024-07-07 08:03 checkpoint/timens-0.img
-rw-r--r-- 0/0             409 2024-07-07 08:03 checkpoint/tmpfs-dev-124.tar.gz.img
-rw-r--r-- 0/0             367 2024-07-07 08:03 checkpoint/tmpfs-dev-126.tar.gz.img
-rw-r--r-- 0/0              98 2024-07-07 08:03 checkpoint/tmpfs-dev-127.tar.gz.img
-rw-r--r-- 0/0              98 2024-07-07 08:03 checkpoint/tmpfs-dev-128.tar.gz.img
-rw-r--r-- 0/0              98 2024-07-07 08:03 checkpoint/tmpfs-dev-129.tar.gz.img
-rw-r--r-- 0/0              29 2024-07-07 08:03 checkpoint/utsns-12.img
-rw------- 0/0             608 2024-07-07 08:03 config.dump
-rw------- 0/0          126603 2024-07-07 08:03 dump.log
-rw-r--r-- 0/0             913 2024-07-07 08:03 io.kubernetes.cri-o.LogPath
-rw-r--r-- 0/0            3584 2024-07-07 08:03 rootfs-diff.tar
-rw------- 0/0           11385 2024-07-07 08:03 spec.dump
-rw-r--r-- 0/0              49 2024-07-07 08:03 stats-dump
$ sudo checkpointctl show cr.tar 

Displaying container checkpoint data from cr.tar

+-----------+--------------------------------+--------------+---------+--------------------------------+--------+-------------+------------+-------------------+
| CONTAINER |             IMAGE              |      ID      | RUNTIME |            CREATED             | ENGINE |     IP      | CHKPT SIZE | ROOT FS DIFF SIZE |
+-----------+--------------------------------+--------------+---------+--------------------------------+--------+-------------+------------+-------------------+
| redis11   | docker.io/library/redis:latest | 5f409fab798b | runc    | 2024-07-07T07:01:02.580221697Z | CRI-O  | 10.85.0.156 | 2.6 MiB    | 3.5 KiB           |
+-----------+--------------------------------+--------------+---------+--------------------------------+--------+-------------+------------+-------------------

$ sudo checkpointctl inspect --ps-tree-cmd cr.tar 

Displaying container checkpoint tree view from cr.tar

redis11
├── Image: docker.io/library/redis:latest
├── ID: 5f409fab798bd0861238ef45b95c789af6068940a0ee2072bada50477aca7793
├── Runtime: runc
├── Created: 2024-07-07T07:01:02.580221697Z
├── Checkpointed: 2024-07-07T07:03:49Z
├── Engine: CRI-O
├── IP: 10.85.0.156
├── Checkpoint size: 2.6 MiB
│   └── Memory pages size: 2.6 MiB
├── Root FS diff size: 3.5 KiB
└── Process tree
    └── [1]  redis-server

@tonyliu666
Copy link
Author

@rst0git , I dump all the logs from crio on the node which the pod deployed on. Because github only supports 25MB file upload, I can't put the log file here. I fetch the snippets of the logs related with this images.

Jul 08 15:06:08 workernode02 crio[319557]: time="2024-07-08 15:06:08.052512435Z" level=info msg="Resolving "tonyliu666/checkpoint-new23" using unqualified-search registries (/etc/containers/registries.conf)"
Jul 08 15:06:08 workernode02 crio[319557]: time="2024-07-08 15:06:08.053039871Z" level=info msg="Image tonyliu666/checkpoint-new23:latest not found" id=9cc053ad-615f-4b1f-a129-edf38d2c87b0 name=/runtime.v1.ImageService/ImageStatus
Jul 08 15:06:08 workernode02 crio[319557]: time="2024-07-08 15:06:08.054273102Z" level=info msg="Pulling image: tonyliu666/checkpoint-new23:latest" id=32bae553-a22a-461b-bdd0-747fc08152c5 name=/runtime.v1.ImageService/PullImage
Jul 08 15:06:08 workernode02 crio[319557]: time="2024-07-08 15:06:08.054491262Z" level=info msg="Resolving "tonyliu666/checkpoint-new23" using unqualified-search registries (/etc/containers/registries.conf)"
Jul 08 15:06:08 workernode02 crio[319557]: time="2024-07-08 15:06:08.061486071Z" level=info msg="Trying to access "docker.io/tonyliu666/checkpoint-new23:latest""
Jul 08 15:06:10 workernode02 crio[319557]: time="2024-07-08 15:06:10.177680386Z" level=info msg="Trying to access "docker.io/tonyliu666/checkpoint-new23:latest""
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.914871605Z" level=info msg="Pulled image: docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a" id=32bae553-a22a-461b-bdd0-747fc08152c5 name=/runtime.v1.ImageService/PullImage
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.917633818Z" level=info msg="Checking image status: tonyliu666/checkpoint-new23:latest" id=f124ad5a-07b4-40e8-a4ee-d71bdf29068f name=/runtime.v1.ImageService/ImageStatus
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.918103999Z" level=info msg="Resolving "tonyliu666/checkpoint-new23" using unqualified-search registries (/etc/containers/registries.conf)"
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.919255164Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:0b02ae272006543829f285a17ee2ed22b39d9c0eb4658405d4c31ab5390e6d2a,RepoTags:[docker.io/tonyliu666/checkpoint-new23:latest],RepoDigests:[docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a],Size_:2898715,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: redis11,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=f124ad5a-07b4-40e8-a4ee-d71bdf29068f name=/runtime.v1.ImageService/ImageStatus
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.920727567Z" level=info msg="Creating container: default/redis111/checkpoint" id=604184e2-4269-4d7f-a400-dd9fcc6f9881 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.920825241Z" level=info msg="Checking image status: docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a" id=604184e2-4269-4d7f-a400-dd9fcc6f9881 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.922669584Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:0b02ae272006543829f285a17ee2ed22b39d9c0eb4658405d4c31ab5390e6d2a,RepoTags:[docker.io/tonyliu666/checkpoint-new23:latest],RepoDigests:[docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a],Size_:2898715,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: redis11,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=604184e2-4269-4d7f-a400-dd9fcc6f9881 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.922779186Z" level=info msg="Checking image status: docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a" id=604184e2-4269-4d7f-a400-dd9fcc6f9881 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.924231115Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:0b02ae272006543829f285a17ee2ed22b39d9c0eb4658405d4c31ab5390e6d2a,RepoTags:[docker.io/tonyliu666/checkpoint-new23:latest],RepoDigests:[docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a],Size_:2898715,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: redis11,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=604184e2-4269-4d7f-a400-dd9fcc6f9881 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.926314090Z" level=warning msg="Allowed annotations are specified for workload []"
Jul 08 15:06:13 workernode02 crio[319557]: time="2024-07-08 15:06:13.958079748Z" level=info msg="Starting container: 86f4a4965add756398825ed203a34fad8917b52ca4bc8e441ca8d2c7edb44102" id=055532ff-3701-4984-9569-ba951d3c3d7f name=/runtime.v1.RuntimeService/StartContainer
Jul 08 15:06:14 workernode02 crio[319557]: time="2024-07-08 15:06:14.524586307Z" level=error msg="Container restore error: time="2024-07-08T15:06:14Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: /run/containers/storage/overlay-containers/86f4a4965add756398825ed203a34fad8917b52ca4bc8e441ca8d2c7edb44102/userdata/restore.log"\n" id=055532ff-3701-4984-9569-ba951d3c3d7f name=/runtime.v1.RuntimeService/StartContainer
Jul 08 15:06:15 workernode02 crio[319557]: time="2024-07-08 15:06:15.568311496Z" level=info msg="Checking image status: tonyliu666/checkpoint-new23:latest" id=d23e05e9-efcf-44c8-a966-ea06d400ec17 name=/runtime.v1.ImageService/ImageStatus
Jul 08 15:06:15 workernode02 crio[319557]: time="2024-07-08 15:06:15.568722418Z" level=info msg="Resolving "tonyliu666/checkpoint-new23" using unqualified-search registries (/etc/containers/registries.conf)"
Jul 08 15:06:15 workernode02 crio[319557]: time="2024-07-08 15:06:15.568940280Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:0b02ae272006543829f285a17ee2ed22b39d9c0eb4658405d4c31ab5390e6d2a,RepoTags:[docker.io/tonyliu666/checkpoint-new23:latest],RepoDigests:[docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a],Size_:2898715,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: redis11,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=d23e05e9-efcf-44c8-a966-ea06d400ec17 name=/runtime.v1.ImageService/ImageStatus
Jul 08 15:06:15 workernode02 crio[319557]: time="2024-07-08 15:06:15.570135396Z" level=info msg="Pulling image: tonyliu666/checkpoint-new23:latest" id=54526984-189a-4c6f-abd4-fcea13bbc485 name=/runtime.v1.ImageService/PullImage
Jul 08 15:06:15 workernode02 crio[319557]: time="2024-07-08 15:06:15.570408187Z" level=info msg="Resolving "tonyliu666/checkpoint-new23" using unqualified-search registries (/etc/containers/registries.conf)"
Jul 08 15:06:15 workernode02 crio[319557]: time="2024-07-08 15:06:15.573516550Z" level=info msg="Trying to access "docker.io/tonyliu666/checkpoint-new23:latest""
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.449083344Z" level=info msg="Pulled image: docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a" id=54526984-189a-4c6f-abd4-fcea13bbc485 name=/runtime.v1.ImageService/PullImage
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.450500766Z" level=info msg="Checking image status: tonyliu666/checkpoint-new23:latest" id=5085aa36-f8d3-438d-9c31-3288345ce1ef name=/runtime.v1.ImageService/ImageStatus
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.450885422Z" level=info msg="Resolving "tonyliu666/checkpoint-new23" using unqualified-search registries (/etc/containers/registries.conf)"
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.451079794Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:0b02ae272006543829f285a17ee2ed22b39d9c0eb4658405d4c31ab5390e6d2a,RepoTags:[docker.io/tonyliu666/checkpoint-new23:latest],RepoDigests:[docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a],Size_:2898715,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: redis11,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=5085aa36-f8d3-438d-9c31-3288345ce1ef name=/runtime.v1.ImageService/ImageStatus
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.452736392Z" level=info msg="Creating container: default/redis111/checkpoint" id=23b45300-0eb7-423c-af4f-b0800108b267 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.452829396Z" level=info msg="Checking image status: docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a" id=23b45300-0eb7-423c-af4f-b0800108b267 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.453305517Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:0b02ae272006543829f285a17ee2ed22b39d9c0eb4658405d4c31ab5390e6d2a,RepoTags:[docker.io/tonyliu666/checkpoint-new23:latest],RepoDigests:[docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a],Size_:2898715,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: redis11,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=23b45300-0eb7-423c-af4f-b0800108b267 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.453438083Z" level=info msg="Checking image status: docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a" id=23b45300-0eb7-423c-af4f-b0800108b267 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.454044281Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:0b02ae272006543829f285a17ee2ed22b39d9c0eb4658405d4c31ab5390e6d2a,RepoTags:[docker.io/tonyliu666/checkpoint-new23:latest],RepoDigests:[docker.io/tonyliu666/checkpoint-new23@sha256:67f71fbbcceb6915897d8cd8101020f3dbb9bb69261902e8b3ee6753832c005a],Size_:2898715,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: redis11,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=23b45300-0eb7-423c-af4f-b0800108b267 name=/runtime.v1.RuntimeService/CreateContainer
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.456468335Z" level=warning msg="Allowed annotations are specified for workload []"
Jul 08 15:06:17 workernode02 crio[319557]: time="2024-07-08 15:06:17.491280765Z" level=info msg="Starting container: 7bfbf302257deafbf57b1f91f4e00812df993bd858bea8b727ab0b50e340873e" id=75cd83fe-e780-41ac-8dc3-7cc852ec4ebe name=/runtime.v1.RuntimeService/StartContainer
Jul 08 15:06:18 workernode02 crio[319557]: time="2024-07-08 15:06:18.038462965Z" level=error msg="Container restore error: time="2024-07-08T15:06:18Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: /run/containers/storage/overlay-containers/7bfbf302257deafbf57b1f91f4e00812df993bd858bea8b727ab0b50e340873e/userdata/restore.log"\n" id=75cd83fe-e780-41ac-8dc3-7cc852ec4ebe name=/runtime.v1.RuntimeService/StartContainer
Jul 08 15:06:18 workernode02 crio[319557]: time="2024-07-08 15:06:18.197397538Z" level=info msg="Stopping pod sandbox: 5a446d8a454e9dd8d6648cde921ff9fc4663763ccc60853cda5eef32b8b26bf5" id=87499851-b864-49ac-888d-612587c7c1f9 name=/runtime.v1.RuntimeService/StopPodSandbox
Jul 08 15:06:18 workernode02 crio[319557]: time="2024-07-08 15:06:18.197587517Z" level=info msg="Stopped pod sandbox (already stopped): 5a446d8a454e9dd8d6648cde921ff9fc4663763ccc60853cda5eef32b8b26bf5" id=87499851-b864-49ac-888d-612587c7c1f9 name=/runtime.v1.RuntimeService/StopPodSandbox
Jul 08 15:06:18 workernode02 crio[319557]: time="2024-07-08 15:06:18.199890105Z" level=info msg="Removing pod sandbox: 5a446d8a454e9dd8d6648cde921ff9fc4663763ccc60853cda5eef32b8b26bf5" id=b3ba5444-2e79-415d-b0f8-4ca4080ae02b name=/runtime.v1.RuntimeService/RemovePodSandbox
Jul 08 15:06:18 workernode02 crio[319557]: time="2024-07-08 15:06:18.218614374Z" level=info msg="Removed pod sandbox: 5a446d8a454e9dd8d6648cde921ff9fc4663763ccc60853cda5eef32b8b26bf5" id=b3ba5444-2e79-415d-b0f8-4ca4080ae02b name=/runtime.v1.RuntimeService/RemovePodSandbox
Jul 08 15:06:18 workernode02 crio[319557]: time="2024-07-08 15:06:18.593979430Z" level=info msg="Checking image status: tonyliu666/checkpoint-new23:latest" id=3e5ffcf7-40c5-4792-85b6-95760d6f2dde name=/runtime.v1.ImageService/ImageStatus
Jul 08 15:06:18 workernode02 crio[319557]: time="2024-07-08 15:06:18.594340089Z" level=info msg="Resolving "tonyliu666/checkpoint-new23" using unqualified-search registries (/etc/containers/registriies.conf)"

@adrianreber
Copy link
Member

This is something different I think. Now it says that CRIU failed and more information can be found in
/run/containers/storage/overlay-containers/86f4a4965add756398825ed203a34fad8917b52ca4bc8e441ca8d2c7edb44102/userdata/restore.log

@tonyliu666
Copy link
Author

tonyliu666 commented Jul 9, 2024

That tonyliu666/checkpoint-new23 image cause the RuntimeError in crio. I also tested the other pod which is running python flask, and I found it was in the CreateContainerError state which reported the error like:

Warning Failed 20s (x8 over 4m53s) kubelet Error: failed to read "spec.dump": open /var/lib/containers/storage/overlay/89655a67ca2d1e6cbd28a2cd61a8077cb443ab960d54209a00161fea9afa480d/merged/spec.dump: no such file or directory

And the logs in crio are like:

Jul 09 05:02:37 workernode02 crio[319557]: time="2024-07-09 05:02:37.706799140Z" level=info msg="Trying to access "docker.io/tonyliu666/flask:latest""
Jul 09 05:03:16 workernode02 crio[319557]: time="2024-07-09 05:03:16.924001963Z" level=info msg="Pulled image: docker.io/tonyliu666/flask@sha256:bc4855b7f194a6955dc8aac1d9fa6a154480482636d495c325ff9aff097afe84" id=f8684d52-fd4e-41b3-b884-dc08bb1da8b3 name=/runtime.v1.ImageService/PullImage
Jul 09 05:03:16 workernode02 crio[319557]: time="2024-07-09 05:03:16.927264788Z" level=info msg="Checking image status: docker.io/tonyliu666/flask:latest" id=9c7f2c9b-04c6-45d8-883b-b338e5177b5e name=/runtime.v1.ImageService/ImageStatus
Jul 09 05:03:16 workernode02 crio[319557]: time="2024-07-09 05:03:16.927675657Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:8b7df904db5cdefac7a994fb394e580a614cc5f5f9c3a2b4cb450c61aa31dc26,RepoTags:[docker.io/tonyliu666/flask:latest],RepoDigests:[docker.io/tonyliu666/flask@sha256:bc4855b7f194a6955dc8aac1d9fa6a154480482636d495c325ff9aff097afe84],Size_:19805842,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: flask,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=9c7f2c9b-04c6-45d8-883b-b338e5177b5e name=/runtime.v1.ImageService/ImageStatus
Jul 09 05:03:16 workernode02 crio[319557]: time="2024-07-09 05:03:16.929996093Z" level=info msg="Creating container: default/flask-new/flask" id=34897eea-8972-4df0-9421-07ca659f3222 name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 05:03:16 workernode02 crio[319557]: time="2024-07-09 05:03:16.930132623Z" level=info msg="Checking image status: docker.io/tonyliu666/flask@sha256:bc4855b7f194a6955dc8aac1d9fa6a154480482636d495c325ff9aff097afe84" id=34897eea-8972-4df0-9421-07ca659f3222 name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 05:03:16 workernode02 crio[319557]: time="2024-07-09 05:03:16.930658559Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:8b7df904db5cdefac7a994fb394e580a614cc5f5f9c3a2b4cb450c61aa31dc26,RepoTags:[docker.io/tonyliu666/flask:latest],RepoDigests:[docker.io/tonyliu666/flask@sha256:bc4855b7f194a6955dc8aac1d9fa6a154480482636d495c325ff9aff097afe84],Size_:19805842,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: flask,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=34897eea-8972-4df0-9421-07ca659f3222 name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 05:03:16 workernode02 crio[319557]: time="2024-07-09 05:03:16.930798765Z" level=info msg="Checking image status: docker.io/tonyliu666/flask@sha256:bc4855b7f194a6955dc8aac1d9fa6a154480482636d495c325ff9aff097afe84" id=34897eea-8972-4df0-9421-07ca659f3222 name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 05:03:16 workernode02 crio[319557]: time="2024-07-09 05:03:16.931156192Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:8b7df904db5cdefac7a994fb394e580a614cc5f5f9c3a2b4cb450c61aa31dc26,RepoTags:[docker.io/tonyliu666/flask:latest],RepoDigests:[docker.io/tonyliu666/flask@sha256:bc4855b7f194a6955dc8aac1d9fa6a154480482636d495c325ff9aff097afe84],Size_:19805842,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: flask,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=34897eea-8972-4df0-9421-07ca659f3222 name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 05:03:30 workernode02 crio[319557]: time="2024-07-09 05:03:30.695058003Z" level=info msg="Checking image status: docker.io/tonyliu666/flask:latest" id=c26bd5f9-6ff2-4215-8544-421c404ffa87 name=/runtime.v1.ImageService/ImageStatus

I've also pushed the python flask image to the dockerhub, https://hub.docker.com/repository/docker/tonyliu666/flask.

@adrianreber
Copy link
Member

The image docker.io/tonyliu666/flask:latest is in the wrong format. That is why it does not work. It does not contain any of the expected files but it has the checkpoint annotation.

@tonyliu666
Copy link
Author

tonyliu666 commented Jul 9, 2024

How about this one, https://hub.docker.com/repository/docker/tonyliu666/httpd/general ? I checked this one by using the checkpointctl command and I think this one contain the files we want, but I still got the CreateContainerError. The output I checked from the checkpointctl command:

../checkpointctl show blobs/sha256/ac85816af204246b50a457ea5f7787ee86163205f70304493e234b3c6ad106bc

Displaying container checkpoint data from blobs/sha256/ac85816af204246b50a457ea5f7787ee86163205f70304493e234b3c6ad106bc

+-----------+--------------------------------+--------------+---------+--------------------------------+--------+-------------+------------+-------------------+
| CONTAINER | IMAGE | ID | RUNTIME | CREATED | ENGINE | IP | CHKPT SIZE | ROOT FS DIFF SIZE |
+-----------+--------------------------------+--------------+---------+--------------------------------+--------+-------------+------------+-------------------+
| httpd | docker.io/library/httpd:latest | 15ec9f8e3e1c | runc | 2024-07-09T08:00:04.884839469Z | CRI-O | 10.85.0.172 | 5.2 MiB | 3.5 KiB |
+-----------+--------------------------------+--------------+---------+--------------------------------+--------+-------------+------------+-------------------+

and I extracted the files from the tar file:

tar xvf blobs/sha256/ac85816af204246b50a457ea5f7787ee86163205f70304493e234b3c6ad106bc -C inspect
stats-dump
dump.log
checkpoint/
checkpoint/cgroup.img
checkpoint/core-1.img
checkpoint/core-10.img
checkpoint/core-13.img
checkpoint/core-14.img
checkpoint/core-16.img
checkpoint/core-17.img
checkpoint/core-18.img
checkpoint/core-19.img
checkpoint/core-20.img
checkpoint/core-21.img
checkpoint/core-22.img
checkpoint/core-23.img
checkpoint/core-24.img
checkpoint/core-25.img
checkpoint/core-26.img
checkpoint/core-27.img
checkpoint/core-28.img
checkpoint/core-29.img
checkpoint/core-30.img
checkpoint/core-31.img
checkpoint/core-32.img
checkpoint/core-33.img
checkpoint/core-34.img
checkpoint/core-35.img
checkpoint/core-36.img
checkpoint/core-37.img
checkpoint/core-38.img
checkpoint/core-39.img
checkpoint/core-40.img
checkpoint/core-41.img
checkpoint/core-42.img
checkpoint/core-43.img
checkpoint/core-44.img
checkpoint/core-45.img
checkpoint/core-46.img
checkpoint/core-47.img
checkpoint/core-48.img
checkpoint/core-49.img
checkpoint/core-50.img
checkpoint/core-51.img
checkpoint/core-52.img
checkpoint/core-53.img
checkpoint/core-54.img
checkpoint/core-55.img
checkpoint/core-56.img
checkpoint/core-57.img
checkpoint/core-58.img
checkpoint/core-59.img
checkpoint/core-60.img
checkpoint/core-61.img
checkpoint/core-62.img
checkpoint/core-63.img
checkpoint/core-64.img
checkpoint/core-65.img
checkpoint/core-66.img
checkpoint/core-67.img
checkpoint/core-68.img
checkpoint/core-69.img
checkpoint/core-70.img
checkpoint/core-71.img
checkpoint/core-72.img
checkpoint/core-73.img
checkpoint/core-74.img
checkpoint/core-75.img
checkpoint/core-76.img
checkpoint/core-77.img
checkpoint/core-78.img
checkpoint/core-79.img
checkpoint/core-8.img
checkpoint/core-80.img
checkpoint/core-81.img
checkpoint/core-82.img
checkpoint/core-83.img
checkpoint/core-84.img
checkpoint/core-85.img
checkpoint/core-86.img
checkpoint/core-87.img
checkpoint/core-88.img
checkpoint/core-89.img
checkpoint/core-9.img
checkpoint/core-90.img
checkpoint/core-91.img
checkpoint/descriptors.json
checkpoint/fdinfo-2.img
checkpoint/fdinfo-3.img
checkpoint/fdinfo-4.img
checkpoint/fdinfo-5.img
checkpoint/files.img
checkpoint/fs-1.img
checkpoint/fs-10.img
checkpoint/fs-8.img
checkpoint/fs-9.img
checkpoint/ids-1.img
checkpoint/ids-10.img
checkpoint/ids-8.img
checkpoint/ids-9.img
checkpoint/inventory.img
checkpoint/ipcns-var-11.img
checkpoint/memfd.img
checkpoint/mm-1.img
checkpoint/mm-10.img
checkpoint/mm-8.img
checkpoint/mm-9.img
checkpoint/mountpoints-13.img
checkpoint/netns-10.img
checkpoint/pagemap-1.img
checkpoint/pagemap-10.img
checkpoint/pagemap-8.img
checkpoint/pagemap-9.img
checkpoint/pagemap-shmem-1391.img
checkpoint/pages-1.img
checkpoint/pages-2.img
checkpoint/pages-3.img
checkpoint/pages-4.img
checkpoint/pages-5.img
checkpoint/pipes-data.img
checkpoint/pstree.img
checkpoint/seccomp.img
checkpoint/timens-0.img
checkpoint/tmpfs-dev-130.tar.gz.img
checkpoint/tmpfs-dev-132.tar.gz.img
checkpoint/tmpfs-dev-133.tar.gz.img
checkpoint/tmpfs-dev-134.tar.gz.img
checkpoint/tmpfs-dev-135.tar.gz.img
checkpoint/utsns-12.img
config.dump
spec.dump
bind.mounts
rootfs-diff.tar
io.kubernetes.cri-o.LogPath

The crio outputs are something like these:

Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.697917696Z" level=info msg="Pulling image: localhost/httpd:latest" id=a8ffea68-562c-403f-bd5d-0bb69199dfe9 name=/runtime.v1.ImageService/PullImage
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.699088948Z" level=info msg="Trying to access "localhost/httpd:latest""
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.701814805Z" level=info msg="Pulled image: localhost/httpd@sha256:656c3e2f0fe7e98fb38cbec52874608a81ac26c03fd6335f4f8456ca103c8a29" id=a8ffea68-562c-403f-bd5d-0bb69199dfe9 name=/runtime.v1.ImageService/PullImage
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.703594889Z" level=info msg="Checking image status: localhost/httpd:latest" id=c3cdcab6-0935-492d-9c0b-01a2aecfed4d name=/runtime.v1.ImageService/ImageStatus
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.703975888Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:fc0de49c0a8635ef4e29e07c383a0b59e86fa20ad56153ee01c02ccea3dec12b,RepoTags:[localhost/httpd:latest],RepoDigests:[localhost/httpd@sha256:656c3e2f0fe7e98fb38cbec52874608a81ac26c03fd6335f4f8456ca103c8a29],Size_:6475661,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: httpd,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=c3cdcab6-0935-492d-9c0b-01a2aecfed4d name=/runtime.v1.ImageService/ImageStatus
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.705183405Z" level=info msg="Creating container: default/httpd2/httpd2" id=5f94a6da-922f-4257-a606-cc325a27c3eb name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.705254212Z" level=info msg="Checking image status: localhost/httpd@sha256:656c3e2f0fe7e98fb38cbec52874608a81ac26c03fd6335f4f8456ca103c8a29" id=5f94a6da-922f-4257-a606-cc325a27c3eb name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.705681681Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:fc0de49c0a8635ef4e29e07c383a0b59e86fa20ad56153ee01c02ccea3dec12b,RepoTags:[localhost/httpd:latest],RepoDigests:[localhost/httpd@sha256:656c3e2f0fe7e98fb38cbec52874608a81ac26c03fd6335f4f8456ca103c8a29],Size_:6475661,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: httpd,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=5f94a6da-922f-4257-a606-cc325a27c3eb name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.705751442Z" level=info msg="Checking image status: localhost/httpd@sha256:656c3e2f0fe7e98fb38cbec52874608a81ac26c03fd6335f4f8456ca103c8a29" id=5f94a6da-922f-4257-a606-cc325a27c3eb name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.705964041Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:fc0de49c0a8635ef4e29e07c383a0b59e86fa20ad56153ee01c02ccea3dec12b,RepoTags:[localhost/httpd:latest],RepoDigests:[localhost/httpd@sha256:656c3e2f0fe7e98fb38cbec52874608a81ac26c03fd6335f4f8456ca103c8a29],Size_:6475661,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: httpd,org.opencontainers.image.base.digest: ,org.opencontainers.image.base.name: ,},UserSpecifiedImage:,},Pinned:false,},Info:map[string]string{},}" id=5f94a6da-922f-4257-a606-cc325a27c3eb name=/runtime.v1.RuntimeService/CreateContainer
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.710706667Z" level=warning msg="Allowed annotations are specified for workload []"
Jul 09 08:27:24 workernode02 crio[319557]: time="2024-07-09 08:27:24.711164367Z" level=info msg="RestoreCtr: releasing container name k8s_httpd2_httpd2_default_22141850-3c5f-4ce6-a54c-44e87de7ca01_0" id=5f94a6da-922f-4257-a606-cc325a27c3eb name=/runtime.v1.RuntimeService/CreateContainer

and the logs from the kubelet:

Jul 09 08:39:20 workernode02 kubelet[941]: E0709 08:39:20.342974 941 kuberuntime_manager.go:1261] container &Container{Name:httpd3,Image:tonyliu666/httpd:latest,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-wm47m,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:Always,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod httpd3_default(e88e54c1-7621-4d0d-86b9-6f058a5514e3): CreateContainerError: image not known
Jul 09 08:39:20 workernode02 kubelet[941]: E0709 08:39:20.343028 941 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to "StartContainer" for "httpd3" with CreateContainerError: "image not known"" pod="default/httpd3" podUID="e88e54c1-7621-4d0d-86b9-6f058a5514e3"
Jul 09 08:39:25 workernode02 kubelet[941]: E0709 08:39:25.721851 941 remote_runtime.go:319] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = image not known" podSandboxID="6211ec5e9e4af4825250a8c1d6d57aef20ecb7ed64b95beee794316e5c7cf1c1"

@adrianreber
Copy link
Member

I don't know how you created those images, but if the content is in the blobs/ directory the image will not work.

@tonyliu666
Copy link
Author

tonyliu666 commented Jul 9, 2024

Originally, I used Buildah to build images, but failed to run some pods(like postgresql) with the checkpointed image due to CreateContainerError. So, I build the image with programmatic way(follow your other post, opencontainers/image-spec#962 ), but get the same result. Now, I build the image using buildah way and successfully run the new pod on the other node. Here is the link for the image, https://hub.docker.com/repository/docker/tonyliu666/nginx/general. Then I extract the image to the folder then the structure of folder be like:

|-- blobs
| -- sha256 | |-- 749f8082a65fe5b65a710ae82fa5637ce36abd57a4ab61efdcd8243a050b3436 | |-- 9101d846639cb86d15990964689066c8ff8eba3698f030c48cb4b85d17d7a109 | -- c5ad2331ad8042971d519e1e960df843c91ef522d8a0cf7d08670f63e87494c8
|-- index.json
`-- oci-layout

The checkpointed content is in one of the subfolder in blobs/ directory. I would like to make sure where the checkpointed contents should be located because what I did in my program using the same structure.

@tonyliu666
Copy link
Author

@adrianreber I reinstalled the k8s cluster to version v1.30.0, crio version 1.30.3,criu version 3.16.1. Then, I try to do the same thing(use buildah command to build the image) but got the RuntimeContainerError when deploying the new pod on the same node as the original pod. The errors are:

Warning Failed 37m kubelet Error: failed to restore container httpd: container restore failed: time="2024-07-12T07:05:08Z" level=error msg="criu failed: type NOTIFY errno 0\nlog file: /run/containers/storage/overlay-containers/httpd/userdata/restore.log"

The image link is here: https://hub.docker.com/r/tonyliu666/checkpoint-image/tags

@adrianreber
Copy link
Member

We need to see the content of /run/containers/storage/overlay-containers/httpd/userdata/restore.log.

@tonyliu666
Copy link
Author

I cd to this /run/containers/storage/overlay-containers/httpd/userdata folder, but I can't find the restore.log

ls d58e7ec2ed4fd44fb1c3a621bf702e8f85ab56145de039b15e7b9f60ca918bfd/userdata
attach config.json conmon-pidfile ctl pidfile winsz

@rst0git
Copy link
Member

rst0git commented Jul 12, 2024

I can't find the restore.log

As Adrian mention above, the missing restore.log is a known problem (cri-o/cri-o#7904).

You can use echo "log-file /tmp/criu.log" >> /etc/criu/runc.conf as a workaround for this problem to overwrite the path of the log file to /tmp/criu.log.

@tonyliu666
Copy link
Author

@rst0git Thanks for your remider. Here is the log file:
criu.log

@adrianreber
Copy link
Member

Looks like you are using Ubuntu 's broken 3.16.1 CRIU. Please use a newer version of CRIU.

@tonyliu666
Copy link
Author

tonyliu666 commented Jul 12, 2024

I change the new version of criu, and it's successful to deploy a new checkpointed pod on the original node. But it's still failed to deploy a new pod to the other node, the error is still CreateContainerError with "image not known" log.
The crio logs:

ul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.157110310Z" level=info msg="Checking image status: tonyliu666/httpd:latest" id=04635601-626c-445d-b5de-f99cf3915899 name=/runtime.v1.ImageService/ImageStatus
Jul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.157355097Z" level=info msg="Resolving "tonyliu666/httpd" using unqualified-search registries (/etc/containers/registries.conf.d/crio.conf)"
Jul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.157507246Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:b5cba08c343fb5af20033f082d4f59ca76254aa65f95ccaa131d6b0c8c8ee75b,RepoTags:[docker.io/tonyliu666/httpd:latest],RepoDigests:[docker.io/tonyliu666/httpd@sha256:2d15dd15dd28d8a02162c83bd5817f6093484bc0ec86e65bd6b21c7348569b76],Size_:5611154,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: httpd,},UserSpecifiedImage:,RuntimeHandler:,},Pinned:false,},Info:map[string]string{},}" id=04635601-626c-445d-b5de-f99cf3915899 name=/runtime.v1.ImageService/ImageStatus
Jul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.159580033Z" level=info msg="Checking image status: tonyliu666/httpd:latest" id=83ec5ee8-517a-41fa-8582-7454492e0bae name=/runtime.v1.ImageService/ImageStatus
Jul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.160366661Z" level=info msg="Resolving "tonyliu666/httpd" using unqualified-search registries (/etc/containers/registries.conf.d/crio.conf)"
Jul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.160892752Z" level=info msg="Image status: &ImageStatusResponse{Image:&Image{Id:b5cba08c343fb5af20033f082d4f59ca76254aa65f95ccaa131d6b0c8c8ee75b,RepoTags:[docker.io/tonyliu666/httpd:latest],RepoDigests:[docker.io/tonyliu666/httpd@sha256:2d15dd15dd28d8a02162c83bd5817f6093484bc0ec86e65bd6b21c7348569b76],Size_:5611154,Uid:nil,Username:,Spec:&ImageSpec{Image:,Annotations:map[string]string{io.kubernetes.cri-o.annotations.checkpoint.name: httpd,},UserSpecifiedImage:,RuntimeHandler:,},Pinned:false,},Info:map[string]string{},}" id=83ec5ee8-517a-41fa-8582-7454492e0bae name=/runtime.v1.ImageService/ImageStatus
Jul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.162814601Z" level=info msg="Creating container: default/httpd3/httpd" id=976c0287-7907-499b-a1e7-21c0df795e67 name=/runtime.v1.RuntimeService/CreateContainer
Jul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.168168005Z" level=info msg="Allowed annotations are specified for workload []"
Jul 12 13:11:25 workernode01 crio[131529]: time="2024-07-12 13:11:25.168545521Z" level=info msg="RestoreCtr: releasing container name k8s_httpd_httpd3_default_15b8d94a-f2f9-4a31-b097-0ac609e991fd_0" id=976c0287-7907-499b-a1e7-21c0df795e67 name=/runtime.v1.RuntimeService/CreateContainer

kubelet logs:

878 kuberuntime_manager.go:1256] container &Container{Name:httpd,Image:tonyliu666/httpd:latest,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{},Resources:ResourceRequirements{Limits:ResourceList{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount{VolumeMount{Name:kube-api-access-sk4bw,ReadOnly:true,MountPath:/var/run/secrets/kubernetes.io/serviceaccount,SubPath:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:Always,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod httpd3_default(15b8d94a-f2f9-4a31-b097-0ac609e991fd): CreateContainerError: reading image "c0c20df5e7be79dc8de00a6575529252b684a79b849e8e090517ef451e036966": locating image with ID "c0c20df5e7be79dc8de00a6575529252b684a79b849e8e090517ef451e036966": image not known
Jul 12 13:12:15 workernode01 kubelet[878]: E0712 13:12:15.208106 878 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to "StartContainer" for "httpd" with CreateContainerError: "reading image \"c0c20df5e7be79dc8de00a6575529252b684a79b849e8e090517ef451e036966\": locating image with ID \"c0c20df5e7be79dc8de00a6575529252b684a79b849e8e090517ef451e036966\": image not known"" pod="default/httpd3" podUID="15b8d94a-f2f9-4a31-b097-0ac609e991fd"

The image link: https://hub.docker.com/repository/docker/tonyliu666/httpd/general

@tonyliu666
Copy link
Author

tonyliu666 commented Jul 13, 2024

@adrianreber , I found the node which restores the pod will fetch the original image(like: docker.io/library/nginx) instead of the checkpointed image.

The logs on the node which restores the pod:

Jul 13 10:36:51 workernode02 crio[132631]: time="2024-07-13 10:36:51.289182393Z" level=error msg="RestoreCtr: error encountered: reading image "9c893be668ac8f0304740d6373e6939032016ad390e62f87ed2a9cabc4da13ef": locating image with ID "9c893be668ac8f0304740d6373e6939032016ad390e62f87ed2a9cabc4da13ef": image not known" id=be46f08d-301d-4838-8c8c-73ff91a3d504 name=/runtime.v1.RuntimeService/CreateContainer.

And the image listed on the original node:

crictl images | grep redis
docker.io/library/redis latest 9c893be668ac8 120MB

However, the checkpointed image whose image id is like:

crictl images | grep redis
docker.io/tonyliu666/redis latest 65af94592b670

I've made sure the destination node can successfully pull the checkpointed image from DockerHub. But when running the pod, it will create that pod using the original image.

@tonyliu666
Copy link
Author

After checking why this CreateContainerError happens, I found the reason is that if the destination node doesn't have the original image(eg: docker.io/library/redis), it could fail for creating the pod due to missing the rootFSImageRef. After pulling the original image, everything can work.

@tonyliu666
Copy link
Author

@adrianreber one more problem is that the postgres pod is still facing the RunContainerError. I dump the criu.log here.
criu.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants