Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc run/create: refuse non-empty cgroup; runc exec: refuse frozen cgroup #3131

Closed
wants to merge 20 commits into from

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Aug 5, 2021

OK, this is being split into a few more digestible PRs:

I. cgroup manager refactoring

Very high level overview:

  • allow cgroup instantiation (NewManager) to return errors;
  • initialize cgroup paths during instantiation, so paths are always available
    (and now we can use a new instance for e.g. Destroy or Exists,
    which was not possible before);
  • simplifies cgroup manager instantiation, replacing the complicated
    logic of choosing a cgroup manager with a simple manager.New() call.
  • fixes some panics in cgroup manager (with Resources==nil).

This helps further commits, and makes cgroup manager easier to use
from e.g. Kubernetes.

Fixes: #3177
Fixes: #3178

II. runc run/create: refuse non-empty cgroup

Currently runc allows multiple containers to share the same cgroup (for
example, by having the same cgroupPath in config.json). While such
shared configuration might be OK, there are some issues:

  • When each container has its own resource limits, the order of
    containers start determines whose limits will be effectively applied.

  • When one of containers is paused, all others are paused, too.

  • When a container is paused, any attempt to do runc create/run/exec
    end up with runc init stuck inside a frozen cgroup.

  • When a systemd cgroup manager is used, this becomes even worse -- such
    as, stop (or even failed start) of any container results in
    "stopTransientUnit" command being sent to systemd, and so (depending
    on unit properties) other containers can receive SIGTERM, be killed
    after a timeout etc.

All this may lead to various hard-to-debug situations in production
(runc init stuck, cgroup removal error, wrong resource limits, container
init not working as zombie reaper, etc).

One obvious solution is to require a non-existent or empty cgroup
for the new container, and fail otherwise.

This is exactly what this PR implements.

III. runc delete -f: don't hang on a paused container

runc delete -f used to hang if a container is paused (on cgroup v1).
Fix this, add a test case.

Fixes: #3134

IV. runc exec: refuse paused container

This bugged me a few times during runc development. A new container is
run, and suddenly runc init is stuck, 🎶 and nothing ever happens, and
I wonder... 🎶

Figuring out that the cause of it is (pre-created) frozen cgroup is not
very obvious (to say at least).

The fix is to add a check that refuses to exec in a paused container.
A (very bad) alternative to that would be to thaw the cgroup.

Implement the fix, add a test case.

V. runc run: refuse cgroup if frozen

Sometimes a container cgroup already exists but is frozen.
When this happens, runc init hangs, and it's not clear what is going on.

Refuse to run in a frozen cgroup; add a test case.


Fixes: #3132

Proposed changelog entry

libcontainer API changes:
* New configs.Cgroup structure fields (#3177):
  * Systemd (whether to use systemd cgroup manager);
  * Rootless (whether to use rootless cgroups).
* New cgroups/manager package aiming to simplify cgroup manager instantiation. (#3177)
* All cgroup managers' instantiation methods now initialize cgroup paths and can return errors.
  This allows to use any cgroup manager method (e.g. Exists, Destroy, Set, GetStats) right after
  instantiation, which was not possible before (as paths were initialized in Apply only). (#3178)
Bugfixes:
* runc delete -f now succeeds (rather than timeouts) on a paused container (#3134)
* runc exec now refuses a paused container (#3132)
* runc run/start now refuses to start a container in a non-empty cgroup (#3132)
* runc run/start now refuses to start a container if its cgroup is frozen (#3132)

@kolyshkin kolyshkin force-pushed the no-hang-on-frozen branch 2 times, most recently from 9a07272 to 823bb76 Compare August 5, 2021 15:16
@kolyshkin

This comment has been minimized.

@kolyshkin kolyshkin force-pushed the no-hang-on-frozen branch 2 times, most recently from c28a49a to 5db6db3 Compare August 5, 2021 17:32
@kolyshkin kolyshkin marked this pull request as draft August 6, 2021 01:56
@kolyshkin kolyshkin changed the title runc create/run/exec: refuse a frozen cgroup runc run/create: refuse existing cgroup; runc exec: refuse frozen cgroup Aug 6, 2021
@kolyshkin kolyshkin force-pushed the no-hang-on-frozen branch 8 times, most recently from 7b0ae0b to 04f43d3 Compare August 11, 2021 02:39
@kolyshkin

This comment has been minimized.

@kolyshkin

This comment has been minimized.

@kolyshkin kolyshkin force-pushed the no-hang-on-frozen branch 7 times, most recently from 3ed90ff to e3d0a62 Compare August 19, 2021 23:53
@kolyshkin kolyshkin marked this pull request as ready for review August 24, 2021 21:28
It does not make sense to calculate slice and unit 10+ times.
Move those out of the loop.

Signed-off-by: Kir Kolyshkin <[email protected]>
As ExpandSlice("system.slice") returns "/system.slice", there is no need
to call it for such paths (and the slash will be added by path.Join
anyway).

The same optimization was already done for v2 as part of commit
bf15cc9.

Signed-off-by: Kir Kolyshkin <[email protected]>
1. Make Rootless and Systemd flags part of config.Cgroups.

2. Make all cgroup managers (not just fs2) return error (so it can do
   more initialization -- added by the following commits).

3. Replace complicated cgroup manager instantiation in factory_linux
   by a single (and simple) libcontainer/cgroups/manager.New() function.

4. getUnifiedPath is simplified to check that only a single path is
   supplied (rather than checking that other paths, if supplied,
   are the same).

Signed-off-by: Kir Kolyshkin <[email protected]>
Now fs.go is not very readable as its public API functions are
intermixed with internal stuff about getting cgroup paths.

Move that out to paths.go, without changing any code.

Same for the tests -- move paths-related tests to paths_test.go.

This commit is separate to make the review easier.

Signed-off-by: Kir Kolyshkin <[email protected]>
In case c.Path is set, c.Name and c.Parent are not used, and so
calls to utils.CleanPath are entirely unnecessary. Move them to
inside of the "if" statement body.

Get rid of the intermediate cgPath variable, it is not needed.

Signed-off-by: Kir Kolyshkin <[email protected]>
As this is called from the Apply() method, it's a natural name.

Signed-off-by: Kir Kolyshkin <[email protected]>
1. Dismantle and remove struct cgroupData. It contained three unrelated
   entities (cgroup paths, pid, and resources), and made the code
   harder to read. Most importantly, though, it is not needed.
   Now, subsystems' Apply methods take path, resources, and pid.

   To a reviewer -- the core of the changes is in fs.go and paths.go,
   the rest of it is adapting to the new signatures and related test
   changes.

2. Dismantle and remove struct cgroupTestUtil. This is a followup
   to the previous item -- since cgroupData is gone, there is nothing
   to hold in cgroupTestUtil. The change itself is very small (see
   util_test.go), but this patch is big because of it -- mostly
   because we had to replace helper.cgroup.Resources with
   &config.Resources{}.

Signed-off-by: Kir Kolyshkin <[email protected]>
1. Separate path initialization logic from Apply to initPaths,
   and call initPaths from NewManager, so:
   - we can error out early (in NewManager rather than Apply);
   - always have m.paths available (e.g. in Destroy or Exists).
   - do not unnecessarily call subsysPath from Apply in case
     the paths were already provided.

2. Add a check for non-nil cgroups.Resources to NewManager,
   since initPaths, as well as some controller's Apply methods,
   need it.

3. Move cgroups.Resources.Unified check from Apply to NewManager,
   so we can error out early (same check exists in Set).

Signed-off-by: Kir Kolyshkin <[email protected]>
This is already documented but I guess more explanations won't hurt.

Signed-off-by: Kir Kolyshkin <[email protected]>
This way we
 - won't re-initialize the paths if they were provided;
 - will always have paths ready for every method.

Signed-off-by: Kir Kolyshkin <[email protected]>
cgName and cgParent are only used when cgPath is empty, so move
their cleaning to the body of the appropriate "if" statement.

Signed-off-by: Kir Kolyshkin <[email protected]>
This fixes the same issue as e.g. commit 4f8ccc5
but in a more universal way.

Signed-off-by: Kir Kolyshkin <[email protected]>
Many operations require fsMgr, so let's create it right in
NewUnifiedManager and reuse.

Signed-off-by: Kir Kolyshkin <[email protected]>
Cgroup controllers should never panic, and yet sometimes they do.

Add a unit test to check that controllers never panic when called with
nil arguments and/or resources, and fix a few found cases.

Signed-off-by: Kir Kolyshkin <[email protected]>
runc delete -f is not working for a paused container, since in cgroup v1
SIGKILL does nothing if a process is frozen (unlike cgroup v2, in which
you can kill a frozen process with a fatal signal).

Theoretically, we only need this for v1, but doing it for v2 as well is
OK.

Signed-off-by: Kir Kolyshkin <[email protected]>
Currently, if a container is paused (or its cgroup is frozen), runc exec
just hangs, and it is not obvious why.

Refuse to exec in a frozen container. Add a test case.

Signed-off-by: Kir Kolyshkin <[email protected]>
Currently runc allows multiple containers to share the same cgroup (for
example, by having the same cgroupPath in config.json). While such
shared configuration might be OK, there are some issues:

 - When each container has its own resource limits, the order of
   containers start determines whose limits will be effectively applied.

 - When one of containers is paused, all others are paused, too.

 - When a container is paused, any attempt to do runc create/run/exec
   end up with runc init stuck inside a frozen cgroup.

 - When a systemd cgroup manager is used, this becomes even worse -- such
   as, stop (or even failed start) of any container results in
   "stopTransientUnit" command being sent to systemd, and so (depending on
   unit properties) other containers can receive SIGTERM, be killed after a
   timeout etc.

Any of the above may lead to various hard-to-debug situations in production
(runc init stuck, cgroup removal error, wrong resource limits, init not
reaping zombies etc.).

One obvious solution is to refuse a non-empty cgroup when starting a new
container.

This is exactly what this commit implements.

Signed-off-by: Kir Kolyshkin <[email protected]>
Sometimes a container cgroup already exists but is frozen.
When this happens, runc init hangs, and it's not clear what is going on.

Refuse to run in a frozen cgroup; add a test case.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin
Copy link
Contributor Author

OK it seems that I have to split this one into a few smaller PRs. Alas, the biggest part is cgroup manager refactoring which seemingly can't be split.

@kolyshkin kolyshkin modified the milestones: 1.1.0, 1.2.0 Sep 14, 2021
@cyphar
Copy link
Member

cyphar commented Sep 15, 2021

Regarding this part:

II. runc run/create: refuse non-empty cgroup

A long time ago there were discussions of adding docker run --cgroup=other-container. Was this feature never implemented / are we really sure nobody uses the shared-cgroup "feature" we have at the moment?

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Sep 15, 2021

Regarding this part:

II. runc run/create: refuse non-empty cgroup

A long time ago there were discussions of adding docker run --cgroup=other-container. Was this feature never implemented / are we really sure nobody uses the shared-cgroup "feature" we have at the moment?

I've found traces of this when working on #3136, and yes, this seems to be not implemented. Looking into moby/moby#18654 it seems that it was suggested to use a common cgroup parent for containers instead of this (ultimately this is what the "pod" concept, introduced later, does).

Yet I don't know if anyone actually shares a cgroup for multiple containers in the wild, and if yes, it can be a breaking change. I still think it is needed as such setups introduce a number of issues, outlined above as well as in #3132, and are therefore broken.

I guess unless we try this out we'll never know if this is used in the wild, and how often. In case this is used often, we can revert, in case it's used rarely, we can add a mechanism to skip the check (an env var, a flag file, or something).

@cyphar
Copy link
Member

cyphar commented Sep 15, 2021

@kolyshkin

Yet I don't know if anyone actually shares a cgroup for multiple containers in the wild, and if yes, it can be a breaking change. I still think it is needed as such setups introduce a number of issues, outlined above as well as in #3132, and are therefore broken.

Yeah I agree the fact we haven't gotten any bug reports about how broken it is leads me to believe nobody is using it.

@kolyshkin
Copy link
Contributor Author

Finally split into #3214, #3215, #3216, #3217, #3223, so this one can finally be closed now :)

@kolyshkin kolyshkin closed this Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants