-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkpoint/restore: implement --manage-cgroups-mode ignore #3546
checkpoint/restore: implement --manage-cgroups-mode ignore #3546
Conversation
21a635e
to
d61b160
Compare
d61b160
to
0dddc06
Compare
@opencontainers/runc-maintainers PTAL (this, among the other things, fixes a few flaky tests we had for a few years) |
@opencontainers/runc-maintainers PTAL |
@adrianreber PTAL |
@opencontainers/runc-maintainers PTAL (this, among the other things, fixes a few flaky tests we had for a few years) |
@adrianreber PTAL |
0dddc06
to
dfd5b65
Compare
Looks good. Thanks! |
dfd5b65
to
8268e56
Compare
@opencontainers/runc-maintainers PTAL 🙏🏻 |
@opencontainers/runc-maintainers PTAL 🙏🏻 |
8268e56
to
f5f17cd
Compare
test -d "$orig_path" | ||
|
||
runc checkpoint --work-path ./work-dir --manage-cgroups-mode ignore test_busybox | ||
grep -B 5 Error ./work-dir/dump.log || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is robust, but I don't have an alternative idea, so LGTM.
Eventually we need to have a robust error reporting system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to print anything containing Error
(with some context) in case the checkpointing failed. It's merely a way to debug criu failures, and does not affect the test itself (only its output in case of an error). The || true
part here is so that the test won't fail in case there's no error; IOW, we ignore grep exit code.
Alternatively, we could create an after-test artefact containing all the files, but this works OK so far.
Thinking about it, we might work on making runc do something like what grep does here in case of an error. Currently, this is just criu writing the log file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #3711 to not forget about it
Merge the logic of setPageServer, setManageCgroupsMode, and setEmptyNsMask into criuOptions. This does three things: 1. Fixes ignoring --manage-cgroups-mode on restore; 2. Simplifies the code in checkpoint.go and restore.go; 3. Ensures issues like 1 won't happen again. Signed-off-by: Kir Kolyshkin <[email protected]>
- add the new mode and document it; - slightly improve the --help output; - slightly simplify the parsing code. Signed-off-by: Kir Kolyshkin <[email protected]>
When manage-cgroups-mode: ignore is used, criu still needs to know the cgroup path to work properly (see [1]). Revert "libct/criuApplyCgroups: don't set cgroup paths for v2" This reverts commit d5c57dc. [1]: checkpoint-restore/criu#1793 (comment) Signed-off-by: Kir Kolyshkin <[email protected]>
I don't want to implement it now, because this might result in some new issues, but this is definitely something that is worth implementing. Signed-off-by: Kir Kolyshkin <[email protected]>
This test checks that the container is restored into a different cgroup. To do so, a user should - use --manage-cgroups-mode ignore on both checkpoint and restore; - change the cgroupsPath value in config.json before restoring. The test does some checks to ensure that its logic is correct, and that after the restore the old (original) cgroup does not exist, the new one exists, and the container's init is in that new cgroup. Signed-off-by: Kir Kolyshkin <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
When doing a lazy checkpoint/restore, we should not restore into the same cgroup, otherwise there is a race which result in occasional killing of the restored container (GH opencontainers#2760, opencontainers#2924). The fix is to use --manage-cgroup-mode=ignore, which allows to restore into a different cgroup. Note that since cgroupsPath is not set in config.json, the cgroup is derived from the container name, so calling set_cgroups_path is not needed. For the previous (unsuccessful) attempt to fix this, as well as detailed (and apparently correct) analysis, see commit 36fe3cc. Signed-off-by: Kir Kolyshkin <[email protected]>
f5f17cd
to
c4aa452
Compare
Rebased to resolve a conflict caused by #3655. @AkihiroSuda please re-LGTM |
This patchset fixes a few issues and adds support for
--manage-cgroups-mode ignore
. This option allows to restore a container into a different cgroup than the original one. A test case is added to check that it works as expected.See individual commits for details.
Loosely based on #3447 and my earlier work from 2021.
Closes: #3447
Also, this uses the new mode in lazy checkpoint/restore test, hopefully fixing this test flakiness.
Fixes: #2760
Fixes: #2924
Fixes: #2475
PS I am thinking whether this (
--manage-cgroups-mode ignore
should become a default in 1.2.0, because otherwise changing thecgroupsPath
in container'sconfig.json
before restore doesn't have any effect)