Skip to content

Commit

Permalink
tests/int/checkpoint: fix lazy migration flakiness
Browse files Browse the repository at this point in the history
When doing a lazy checkpoint/restore, we should not restore into the
same cgroup, otherwise there is a race which result in occasional
killing of the restored container (GH #2760, #2924).

The fix is to use --manage-cgroup-mode=ignore, which allows to restore
into a different cgroup.

Note that since cgroupsPath is not set in config.json, the cgroup is
derived from the container name, so calling set_cgroups_path is not
needed.

For the previous (unsuccessful) attempt to fix this, as well as detailed
(and apparently correct) analysis, see commit 36fe3cc.

Signed-off-by: Kir Kolyshkin <[email protected]>
  • Loading branch information
kolyshkin committed Dec 15, 2022
1 parent 6835287 commit c4aa452
Showing 1 changed file with 18 additions and 7 deletions.
25 changes: 18 additions & 7 deletions tests/integration/checkpoint.bats
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,14 @@ function simple_cr() {
# TCP port for lazy migration
port=27277

__runc checkpoint --lazy-pages --page-server 0.0.0.0:${port} --status-fd ${lazy_w} --work-path ./work-dir --image-path ./image-dir test_busybox &
__runc checkpoint \
--lazy-pages \
--page-server 0.0.0.0:${port} \
--status-fd ${lazy_w} \
--manage-cgroups-mode=ignore \
--work-path ./work-dir \
--image-path ./image-dir \
test_busybox &
cpt_pid=$!

# wait for lazy page server to be ready
Expand All @@ -246,14 +253,18 @@ function simple_cr() {
lp_pid=$!

# Restore lazily from checkpoint.
# The restored container needs a different name (as well as systemd
# unit name, in case systemd cgroup driver is used) as the checkpointed
# container is not yet destroyed. It is only destroyed at that point
# in time when the last page is lazily transferred to the destination.
#
# The restored container needs a different name and a different cgroup
# (and a different systemd unit name, in case systemd cgroup driver is
# used) as the checkpointed container is not yet destroyed. It is only
# destroyed at that point in time when the last page is lazily
# transferred to the destination.
#
# Killing the CRIU on the checkpoint side will let the container
# continue to run if the migration failed at some point.
[ -v RUNC_USE_SYSTEMD ] && set_cgroups_path
runc_restore_with_pipes ./image-dir test_busybox_restore --lazy-pages
runc_restore_with_pipes ./image-dir test_busybox_restore \
--lazy-pages \
--manage-cgroups-mode=ignore

wait $cpt_pid

Expand Down

0 comments on commit c4aa452

Please sign in to comment.