-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/int: some refactoring, fix a flake #2881
Conversation
@adrianreber PTAL (last commit only) |
c34f9f3
to
1265c20
Compare
Helper function init_cgroup_paths sets two sets of cgroup path variables for cgroup v1 case (below XXX is cgroup controller name, e.g. MEMORY): 1. CGROUP_XXX_BASE_PATH -- path to XXX controller mount point (e.g. CGROUP_MEMORY_BASE_PATH=/sys/fs/cgroup/memory); 2. CGROUP_XXX -- path to the particular container XXX controller cgroup (e.g. CGROUP_MEMORY=/sys/fs/cgroup/memory/runc-cgroups-integration-test/test-cgroup). The second set of variables is mostly used by check_cgroup_value(), with only two exceptions: - CGROUP_CPU in @test "update rt period and runtime"; - few CGROUP_XXX in @test "runc delete --force in cgroupv1 with subcgroups". Remove these variables, as their values are not used much and are easy to get (as can be seen in modified test cases). While at it, mark some variables as local. Signed-off-by: Kir Kolyshkin <[email protected]>
Need to figure out why rootless fails after the second commit (force randomized cgroup paths). For tomorrow |
Uh... I like your analysis and the description of the solution. Sounds plausible to me. Not sure I can say anything about the actual code changes. You remove more lines than you add. That is usually a good sign 😄 |
Commit 41670e2 removed BUSYBOX_BUNDLE env var, but c3ffd2e was developed before 41670e2 was merged. Everything still works because now BUSYBOX_BUNDLE has no value. Nevertheless, let's remove it to avoid confusion. Signed-off-by: Kir Kolyshkin <[email protected]>
Commit 41670e2 added some randomization to cgroup paths and (if systemd cgroup driver is used) systemd unit names, but the randomization was per bats instance, not per test. Fix this by refactoring init_cgroups_path/set_cgroups_path (moving variable/random part to set_cgroups_path). NOTE though that the randomization is only performed for those tests that explicitly call set_cgroups_path. Signed-off-by: Kir Kolyshkin <[email protected]>
@adrianreber sorry, this is not yet ready, still working on it and I got the patch I wanted you to look at removed. Basically I just change the cgroupsPath, so the container is restored into a different cgroup. I'll let you know once it's ready. |
In check_pipes, make sure we - close all fds we opened in setup_pipes; - check that runc stderr is empty (and fail if it's not). Signed-off-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
1. Remove printing criu args as now they are *always swrk 3. 2. Remove duplicated "feature check says" debug. Before: > DEBU[0000] Using CRIU with following args: [swrk 3] > DEBU[0000] Using CRIU in FEATURE_CHECK mode > DEBU[0000] Feature check says: type:FEATURE_CHECK success:true features:<mem_track:false lazy_pages:true > > DEBU[0000] Feature check says: mem_track:false lazy_pages:true After: > DEBU[0000] Using CRIU in FEATURE_CHECK mode > DEBU[0000] Feature check says: mem_track:false lazy_pages:true Signed-off-by: Kir Kolyshkin <[email protected]>
"checkpoint --lazy-pages and restore" test sometimes fails on restore in our CI on Fedora 33 when systemd cgroup driver is used: > (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f > (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated > ... > (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED. I think what happens is 1. The test runs runc checkpoint in lazy-pages mode in background. 2. The test runs criu lazy-pages in background. 3. The test runs runc restore. Now, all three are working in together: criu restore restores, criu lazy-pages listens for page faults on a uffd and fetch missing pages from runc checkpoint, who serves those pages. At some point criu lazy-pages decides to fetch the rest of the pages, and once it's done it exits, and runc checkpoint, as there are no more pages to serve, exits too. At the end of runc checkpoint the container is removed (see "defer destroy(container)" in checkpoint.go. This involves a call to cgroupManager.Destroy, which, in case systemd manager is used, calls stopUnit, which makes systemd to not just remove the unit, but also send SIGTERM to its processes, if there are any. As the container is being restored into the same systemd unit, sometimes this results in sending SIGTERM to a process which criu restores, and thus restoring fails. The remedy here is to change the name of systemd unit to which the container is restored. Signed-off-by: Kir Kolyshkin <[email protected]>
This presumably fixes a flake, and all the changes (except couple of removed debug prints in the checkpoint code) are in tests/integration, so adding rc94 milestone. |
@cyphar @AkihiroSuda PTAL |
Fixes #2805
See all commits for descriptions. Most important ones are:
tests/int/cpt: fix lazy-pages flakiness
"checkpoint --lazy-pages and restore" test sometimes fails on restore
in our CI on Fedora 33 when systemd cgroup driver is used:
I think what happens is
Now, all three are working in together: criu restore restores, criu
lazy-pages listens for page faults on a uffd and fetch missing pages
from runc checkpoint, who serves those pages.
At some point criu lazy-pages decides to fetch the rest of the pages,
and once it's done it exits, and runc checkpoint, as there are no more
pages to serve, exits too.
At the end of runc checkpoint the container is removed (see "defer
destroy(container)" in checkpoint.go. This involves a call to
cgroupManager.Destroy, which, in case systemd manager is used,
calls stopUnit, which makes systemd to not just remove the unit,
but also send SIGTERM to its processes, if there are any.
As the container is being restored into the same systemd unit,
sometimes this results in sending SIGTERM to a process which
criu restores, and thus restoring fails.
🗒️ for slightly more detailed description of the above, see #2805 (comment))
The remedy here is to change the name of systemd unit to which the
container is restored.
tests/int: really randomize cgroup/unit names
Commit 41670e2 added some randomization to cgroup paths
and (if systemd cgroup driver is used) systemd unit names,
but
the randomization was done only if set_cgroups_path is called,
which is not done by every test;
the randomization was per bats instance, not per test.
Fix both issues by refactoring init_cgroups_path/set_cgroups_path
(moving variable part to set_cgroups_path), and calling the latter
from runc_spec, so it is now applicable to every container.