tests/int: some refactoring, fix a flake #2881

kolyshkin · 2021-04-01T01:30:38Z

See all commits for descriptions. Most important ones are:

tests/int/cpt: fix lazy-pages flakiness

"checkpoint --lazy-pages and restore" test sometimes fails on restore
in our CI on Fedora 33 when systemd cgroup driver is used:

(00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f
(00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
...
(00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED.

I think what happens is

The test runs runc checkpoint in lazy-pages mode in background.
The test runs criu lazy-pages in background.
The test runs runc restore.

Now, all three are working in together: criu restore restores, criu
lazy-pages listens for page faults on a uffd and fetch missing pages
from runc checkpoint, who serves those pages.

At some point criu lazy-pages decides to fetch the rest of the pages,
and once it's done it exits, and runc checkpoint, as there are no more
pages to serve, exits too.

At the end of runc checkpoint the container is removed (see "defer
destroy(container)" in checkpoint.go. This involves a call to
cgroupManager.Destroy, which, in case systemd manager is used,
calls stopUnit, which makes systemd to not just remove the unit,
but also send SIGTERM to its processes, if there are any.

As the container is being restored into the same systemd unit,
sometimes this results in sending SIGTERM to a process which
criu restores, and thus restoring fails.

🗒️ for slightly more detailed description of the above, see #2805 (comment))

The remedy here is to change the name of systemd unit to which the
container is restored.

tests/int: really randomize cgroup/unit names

Commit 41670e2 added some randomization to cgroup paths
and (if systemd cgroup driver is used) systemd unit names,
but

the randomization was done only if set_cgroups_path is called,
which is not done by every test;
the randomization was per bats instance, not per test.

Fix both issues by refactoring init_cgroups_path/set_cgroups_path
(moving variable part to set_cgroups_path), and calling the latter
from runc_spec, so it is now applicable to every container.

kolyshkin · 2021-04-01T01:31:02Z

@adrianreber PTAL (last commit only)

@test

Helper function init_cgroup_paths sets two sets of cgroup path variables for cgroup v1 case (below XXX is cgroup controller name, e.g. MEMORY): 1. CGROUP_XXX_BASE_PATH -- path to XXX controller mount point (e.g. CGROUP_MEMORY_BASE_PATH=/sys/fs/cgroup/memory); 2. CGROUP_XXX -- path to the particular container XXX controller cgroup (e.g. CGROUP_MEMORY=/sys/fs/cgroup/memory/runc-cgroups-integration-test/test-cgroup). The second set of variables is mostly used by check_cgroup_value(), with only two exceptions: - CGROUP_CPU in @test "update rt period and runtime"; - few CGROUP_XXX in @test "runc delete --force in cgroupv1 with subcgroups". Remove these variables, as their values are not used much and are easy to get (as can be seen in modified test cases). While at it, mark some variables as local. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2021-04-01T04:04:47Z

Need to figure out why rootless fails after the second commit (force randomized cgroup paths). For tomorrow

adrianreber · 2021-04-01T12:33:51Z

@adrianreber PTAL (last commit only)

Uh... I like your analysis and the description of the solution. Sounds plausible to me. Not sure I can say anything about the actual code changes. You remove more lines than you add. That is usually a good sign 😄

Commit 41670e2 removed BUSYBOX_BUNDLE env var, but c3ffd2e was developed before 41670e2 was merged. Everything still works because now BUSYBOX_BUNDLE has no value. Nevertheless, let's remove it to avoid confusion. Signed-off-by: Kir Kolyshkin <[email protected]>

Commit 41670e2 added some randomization to cgroup paths and (if systemd cgroup driver is used) systemd unit names, but the randomization was per bats instance, not per test. Fix this by refactoring init_cgroups_path/set_cgroups_path (moving variable/random part to set_cgroups_path). NOTE though that the randomization is only performed for those tests that explicitly call set_cgroups_path. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2021-04-01T19:44:27Z

@adrianreber sorry, this is not yet ready, still working on it and I got the patch I wanted you to look at removed. Basically I just change the cgroupsPath, so the container is restored into a different cgroup. I'll let you know once it's ready.

In check_pipes, make sure we - close all fds we opened in setup_pipes; - check that runc stderr is empty (and fail if it's not). Signed-off-by: Kir Kolyshkin <[email protected]>

Signed-off-by: Kir Kolyshkin <[email protected]>

1. Remove printing criu args as now they are *always swrk 3. 2. Remove duplicated "feature check says" debug. Before: > DEBU[0000] Using CRIU with following args: [swrk 3] > DEBU[0000] Using CRIU in FEATURE_CHECK mode > DEBU[0000] Feature check says: type:FEATURE_CHECK success:true features:<mem_track:false lazy_pages:true > > DEBU[0000] Feature check says: mem_track:false lazy_pages:true After: > DEBU[0000] Using CRIU in FEATURE_CHECK mode > DEBU[0000] Feature check says: mem_track:false lazy_pages:true Signed-off-by: Kir Kolyshkin <[email protected]>

"checkpoint --lazy-pages and restore" test sometimes fails on restore in our CI on Fedora 33 when systemd cgroup driver is used: > (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f > (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated > ... > (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED. I think what happens is 1. The test runs runc checkpoint in lazy-pages mode in background. 2. The test runs criu lazy-pages in background. 3. The test runs runc restore. Now, all three are working in together: criu restore restores, criu lazy-pages listens for page faults on a uffd and fetch missing pages from runc checkpoint, who serves those pages. At some point criu lazy-pages decides to fetch the rest of the pages, and once it's done it exits, and runc checkpoint, as there are no more pages to serve, exits too. At the end of runc checkpoint the container is removed (see "defer destroy(container)" in checkpoint.go. This involves a call to cgroupManager.Destroy, which, in case systemd manager is used, calls stopUnit, which makes systemd to not just remove the unit, but also send SIGTERM to its processes, if there are any. As the container is being restored into the same systemd unit, sometimes this results in sending SIGTERM to a process which criu restores, and thus restoring fails. The remedy here is to change the name of systemd unit to which the container is restored. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2021-04-01T21:47:04Z

This presumably fixes a flake, and all the changes (except couple of removed debug prints in the checkpoint code) are in tests/integration, so adding rc94 milestone.

kolyshkin · 2021-04-05T23:29:18Z

@cyphar @AkihiroSuda PTAL

kolyshkin mentioned this pull request Apr 1, 2021

not ok 15 checkpoint --lazy-pages and restore (Fedora 33, runc restore fails) #2805

Closed

kolyshkin marked this pull request as draft April 1, 2021 02:37

kolyshkin force-pushed the test-rand-cg branch 2 times, most recently from c34f9f3 to 1265c20 Compare April 1, 2021 02:50

kolyshkin force-pushed the test-rand-cg branch from 1265c20 to f09a3e1 Compare April 1, 2021 03:26

kolyshkin added 2 commits April 1, 2021 11:44

kolyshkin force-pushed the test-rand-cg branch from 0d60c46 to e63df1e Compare April 1, 2021 19:39

kolyshkin added 4 commits April 1, 2021 12:57

tests/int/checkpoint: close fds in check_pipes

b09030a

In check_pipes, make sure we - close all fds we opened in setup_pipes; - check that runc stderr is empty (and fail if it's not). Signed-off-by: Kir Kolyshkin <[email protected]>

tests/int/checkpoint: close lazy_r fd

0e08900

Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin marked this pull request as ready for review April 1, 2021 20:56

kolyshkin requested a review from AkihiroSuda April 1, 2021 21:24

kolyshkin added the area/ci label Apr 1, 2021

kolyshkin added this to the 1.0.0-rc94 milestone Apr 1, 2021

mrunalp approved these changes Apr 2, 2021

View reviewed changes

kolyshkin requested a review from cyphar April 5, 2021 23:30

AkihiroSuda approved these changes Apr 6, 2021

View reviewed changes

AkihiroSuda merged commit c453f1a into opencontainers:master Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/int: some refactoring, fix a flake #2881

tests/int: some refactoring, fix a flake #2881

kolyshkin commented Apr 1, 2021

kolyshkin commented Apr 1, 2021

kolyshkin commented Apr 1, 2021

adrianreber commented Apr 1, 2021

kolyshkin commented Apr 1, 2021

kolyshkin commented Apr 1, 2021 •

edited

Loading

kolyshkin commented Apr 5, 2021

tests/int: some refactoring, fix a flake #2881

tests/int: some refactoring, fix a flake #2881

Conversation

kolyshkin commented Apr 1, 2021

tests/int/cpt: fix lazy-pages flakiness

tests/int: really randomize cgroup/unit names

kolyshkin commented Apr 1, 2021

kolyshkin commented Apr 1, 2021

adrianreber commented Apr 1, 2021

kolyshkin commented Apr 1, 2021

kolyshkin commented Apr 1, 2021 • edited Loading

kolyshkin commented Apr 5, 2021

kolyshkin commented Apr 1, 2021 •

edited

Loading