runc run/create: refuse non-empty cgroup; runc exec: refuse frozen cgroup #3131

kolyshkin · 2021-08-05T03:27:48Z

OK, this is being split into a few more digestible PRs:

I. cgroup manager refactoring

Very high level overview:

allow cgroup instantiation (NewManager) to return errors;
initialize cgroup paths during instantiation, so paths are always available
(and now we can use a new instance for e.g. Destroy or Exists,
which was not possible before);
simplifies cgroup manager instantiation, replacing the complicated
logic of choosing a cgroup manager with a simple manager.New() call.
fixes some panics in cgroup manager (with Resources==nil).

This helps further commits, and makes cgroup manager easier to use
from e.g. Kubernetes.

Fixes: #3177
Fixes: #3178

II. runc run/create: refuse non-empty cgroup

Currently runc allows multiple containers to share the same cgroup (for
example, by having the same cgroupPath in config.json). While such
shared configuration might be OK, there are some issues:

When each container has its own resource limits, the order of
containers start determines whose limits will be effectively applied.
When one of containers is paused, all others are paused, too.
When a container is paused, any attempt to do runc create/run/exec
end up with runc init stuck inside a frozen cgroup.
When a systemd cgroup manager is used, this becomes even worse -- such
as, stop (or even failed start) of any container results in
"stopTransientUnit" command being sent to systemd, and so (depending
on unit properties) other containers can receive SIGTERM, be killed
after a timeout etc.

All this may lead to various hard-to-debug situations in production
(runc init stuck, cgroup removal error, wrong resource limits, container
init not working as zombie reaper, etc).

One obvious solution is to require a non-existent or empty cgroup
for the new container, and fail otherwise.

This is exactly what this PR implements.

III. runc delete -f: don't hang on a paused container

runc delete -f used to hang if a container is paused (on cgroup v1).
Fix this, add a test case.

Fixes: #3134

IV. runc exec: refuse paused container

This bugged me a few times during runc development. A new container is
run, and suddenly runc init is stuck, 🎶 and nothing ever happens, and
I wonder... 🎶

Figuring out that the cause of it is (pre-created) frozen cgroup is not
very obvious (to say at least).

The fix is to add a check that refuses to exec in a paused container.
A (very bad) alternative to that would be to thaw the cgroup.

Implement the fix, add a test case.

V. runc run: refuse cgroup if frozen

Sometimes a container cgroup already exists but is frozen.
When this happens, runc init hangs, and it's not clear what is going on.

Refuse to run in a frozen cgroup; add a test case.

Fixes: #3132

Proposed changelog entry

libcontainer API changes:
* New configs.Cgroup structure fields (#3177):
  * Systemd (whether to use systemd cgroup manager);
  * Rootless (whether to use rootless cgroups).
* New cgroups/manager package aiming to simplify cgroup manager instantiation. (#3177)
* All cgroup managers' instantiation methods now initialize cgroup paths and can return errors.
  This allows to use any cgroup manager method (e.g. Exists, Destroy, Set, GetStats) right after
  instantiation, which was not possible before (as paths were initialized in Apply only). (#3178)
Bugfixes:
* runc delete -f now succeeds (rather than timeouts) on a paused container (#3134)
* runc exec now refuses a paused container (#3132)
* runc run/start now refuses to start a container in a non-empty cgroup (#3132)
* runc run/start now refuses to start a container if its cgroup is frozen (#3132)

It does not make sense to calculate slice and unit 10+ times. Move those out of the loop. Signed-off-by: Kir Kolyshkin <[email protected]>

As ExpandSlice("system.slice") returns "/system.slice", there is no need to call it for such paths (and the slash will be added by path.Join anyway). The same optimization was already done for v2 as part of commit bf15cc9. Signed-off-by: Kir Kolyshkin <[email protected]>

1. Make Rootless and Systemd flags part of config.Cgroups. 2. Make all cgroup managers (not just fs2) return error (so it can do more initialization -- added by the following commits). 3. Replace complicated cgroup manager instantiation in factory_linux by a single (and simple) libcontainer/cgroups/manager.New() function. 4. getUnifiedPath is simplified to check that only a single path is supplied (rather than checking that other paths, if supplied, are the same). Signed-off-by: Kir Kolyshkin <[email protected]>

Now fs.go is not very readable as its public API functions are intermixed with internal stuff about getting cgroup paths. Move that out to paths.go, without changing any code. Same for the tests -- move paths-related tests to paths_test.go. This commit is separate to make the review easier. Signed-off-by: Kir Kolyshkin <[email protected]>

In case c.Path is set, c.Name and c.Parent are not used, and so calls to utils.CleanPath are entirely unnecessary. Move them to inside of the "if" statement body. Get rid of the intermediate cgPath variable, it is not needed. Signed-off-by: Kir Kolyshkin <[email protected]>

As this is called from the Apply() method, it's a natural name. Signed-off-by: Kir Kolyshkin <[email protected]>

1. Dismantle and remove struct cgroupData. It contained three unrelated entities (cgroup paths, pid, and resources), and made the code harder to read. Most importantly, though, it is not needed. Now, subsystems' Apply methods take path, resources, and pid. To a reviewer -- the core of the changes is in fs.go and paths.go, the rest of it is adapting to the new signatures and related test changes. 2. Dismantle and remove struct cgroupTestUtil. This is a followup to the previous item -- since cgroupData is gone, there is nothing to hold in cgroupTestUtil. The change itself is very small (see util_test.go), but this patch is big because of it -- mostly because we had to replace helper.cgroup.Resources with &config.Resources{}. Signed-off-by: Kir Kolyshkin <[email protected]>

1. Separate path initialization logic from Apply to initPaths, and call initPaths from NewManager, so: - we can error out early (in NewManager rather than Apply); - always have m.paths available (e.g. in Destroy or Exists). - do not unnecessarily call subsysPath from Apply in case the paths were already provided. 2. Add a check for non-nil cgroups.Resources to NewManager, since initPaths, as well as some controller's Apply methods, need it. 3. Move cgroups.Resources.Unified check from Apply to NewManager, so we can error out early (same check exists in Set). Signed-off-by: Kir Kolyshkin <[email protected]>

This is already documented but I guess more explanations won't hurt. Signed-off-by: Kir Kolyshkin <[email protected]>

This way we - won't re-initialize the paths if they were provided; - will always have paths ready for every method. Signed-off-by: Kir Kolyshkin <[email protected]>

cgName and cgParent are only used when cgPath is empty, so move their cleaning to the body of the appropriate "if" statement. Signed-off-by: Kir Kolyshkin <[email protected]>

This fixes the same issue as e.g. commit 4f8ccc5 but in a more universal way. Signed-off-by: Kir Kolyshkin <[email protected]>

Many operations require fsMgr, so let's create it right in NewUnifiedManager and reuse. Signed-off-by: Kir Kolyshkin <[email protected]>

Cgroup controllers should never panic, and yet sometimes they do. Add a unit test to check that controllers never panic when called with nil arguments and/or resources, and fix a few found cases. Signed-off-by: Kir Kolyshkin <[email protected]>

runc delete -f is not working for a paused container, since in cgroup v1 SIGKILL does nothing if a process is frozen (unlike cgroup v2, in which you can kill a frozen process with a fatal signal). Theoretically, we only need this for v1, but doing it for v2 as well is OK. Signed-off-by: Kir Kolyshkin <[email protected]>

Currently, if a container is paused (or its cgroup is frozen), runc exec just hangs, and it is not obvious why. Refuse to exec in a frozen container. Add a test case. Signed-off-by: Kir Kolyshkin <[email protected]>

Currently runc allows multiple containers to share the same cgroup (for example, by having the same cgroupPath in config.json). While such shared configuration might be OK, there are some issues: - When each container has its own resource limits, the order of containers start determines whose limits will be effectively applied. - When one of containers is paused, all others are paused, too. - When a container is paused, any attempt to do runc create/run/exec end up with runc init stuck inside a frozen cgroup. - When a systemd cgroup manager is used, this becomes even worse -- such as, stop (or even failed start) of any container results in "stopTransientUnit" command being sent to systemd, and so (depending on unit properties) other containers can receive SIGTERM, be killed after a timeout etc. Any of the above may lead to various hard-to-debug situations in production (runc init stuck, cgroup removal error, wrong resource limits, init not reaping zombies etc.). One obvious solution is to refuse a non-empty cgroup when starting a new container. This is exactly what this commit implements. Signed-off-by: Kir Kolyshkin <[email protected]>

Sometimes a container cgroup already exists but is frozen. When this happens, runc init hangs, and it's not clear what is going on. Refuse to run in a frozen cgroup; add a test case. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin · 2021-09-14T18:28:34Z

OK it seems that I have to split this one into a few smaller PRs. Alas, the biggest part is cgroup manager refactoring which seemingly can't be split.

cyphar · 2021-09-15T07:05:54Z

Regarding this part:

II. runc run/create: refuse non-empty cgroup

A long time ago there were discussions of adding docker run --cgroup=other-container. Was this feature never implemented / are we really sure nobody uses the shared-cgroup "feature" we have at the moment?

kolyshkin · 2021-09-15T21:25:43Z

Regarding this part:

II. runc run/create: refuse non-empty cgroup

A long time ago there were discussions of adding docker run --cgroup=other-container. Was this feature never implemented / are we really sure nobody uses the shared-cgroup "feature" we have at the moment?

I've found traces of this when working on #3136, and yes, this seems to be not implemented. Looking into moby/moby#18654 it seems that it was suggested to use a common cgroup parent for containers instead of this (ultimately this is what the "pod" concept, introduced later, does).

Yet I don't know if anyone actually shares a cgroup for multiple containers in the wild, and if yes, it can be a breaking change. I still think it is needed as such setups introduce a number of issues, outlined above as well as in #3132, and are therefore broken.

I guess unless we try this out we'll never know if this is used in the wild, and how often. In case this is used often, we can revert, in case it's used rarely, we can add a mechanism to skip the check (an env var, a flag file, or something).

cyphar · 2021-09-15T23:41:18Z

@kolyshkin

Yet I don't know if anyone actually shares a cgroup for multiple containers in the wild, and if yes, it can be a breaking change. I still think it is needed as such setups introduce a number of issues, outlined above as well as in #3132, and are therefore broken.

Yeah I agree the fact we haven't gotten any bug reports about how broken it is leads me to believe nobody is using it.

kolyshkin · 2021-09-23T23:36:15Z

Finally split into #3214, #3215, #3216, #3217, #3223, so this one can finally be closed now :)

kolyshkin force-pushed the no-hang-on-frozen branch 2 times, most recently from 9a07272 to 823bb76 Compare August 5, 2021 15:16

This comment has been minimized.

Sign in to view

kolyshkin force-pushed the no-hang-on-frozen branch 2 times, most recently from c28a49a to 5db6db3 Compare August 5, 2021 17:32

kolyshkin mentioned this pull request Aug 5, 2021

Multiple containers in the same cgroup #3132

Closed

kolyshkin marked this pull request as draft August 6, 2021 01:56

kolyshkin force-pushed the no-hang-on-frozen branch from 5db6db3 to e551d66 Compare August 6, 2021 03:06

kolyshkin changed the title ~~runc create/run/exec: refuse a frozen cgroup~~ runc run/create: refuse existing cgroup; runc exec: refuse frozen cgroup Aug 6, 2021

kolyshkin force-pushed the no-hang-on-frozen branch 8 times, most recently from 7b0ae0b to 04f43d3 Compare August 11, 2021 02:39

This comment has been minimized.

Sign in to view

kolyshkin mentioned this pull request Aug 12, 2021

libct/cg/sd/v1: fix freezeBeforeSet (alt) #3151

Closed

kolyshkin force-pushed the no-hang-on-frozen branch 2 times, most recently from 2a71e3d to 65faece Compare August 18, 2021 00:08

This comment has been minimized.

Sign in to view

kolyshkin force-pushed the no-hang-on-frozen branch 7 times, most recently from 3ed90ff to e3d0a62 Compare August 19, 2021 23:53

kolyshkin marked this pull request as ready for review August 24, 2021 21:28

kolyshkin added 18 commits September 14, 2021 10:56

libct/cg/sd/v1: optimize initPaths

e172bc0

It does not make sense to calculate slice and unit 10+ times. Move those out of the loop. Signed-off-by: Kir Kolyshkin <[email protected]>

libct/cg/fs: rename join to apply

82d9927

As this is called from the Apply() method, it's a natural name. Signed-off-by: Kir Kolyshkin <[email protected]>

libct/cg/fs: document path removal

1ee5407

This is already documented but I guess more explanations won't hurt. Signed-off-by: Kir Kolyshkin <[email protected]>

libct/cg/sd/v1: move path init to NewLegacyManager

de0d256

This way we - won't re-initialize the paths if they were provided; - will always have paths ready for every method. Signed-off-by: Kir Kolyshkin <[email protected]>

libct/cg/fs2: minor optimization

d149de1

cgName and cgParent are only used when cgPath is empty, so move their cleaning to the body of the appropriate "if" statement. Signed-off-by: Kir Kolyshkin <[email protected]>

libct/cg/sd/v2: move path init to NewUnifiedManager

702a425

This fixes the same issue as e.g. commit 4f8ccc5 but in a more universal way. Signed-off-by: Kir Kolyshkin <[email protected]>

libct/cg/sd/v2: move fsMgr init to NewUnifiedManager

d8c7561

Many operations require fsMgr, so let's create it right in NewUnifiedManager and reuse. Signed-off-by: Kir Kolyshkin <[email protected]>

libct/cg: add Resources=nil unit test

03425c4

Cgroup controllers should never panic, and yet sometimes they do. Add a unit test to check that controllers never panic when called with nil arguments and/or resources, and fix a few found cases. Signed-off-by: Kir Kolyshkin <[email protected]>

runc exec: fail if cgroup is frozen

727ee6d

Currently, if a container is paused (or its cgroup is frozen), runc exec just hangs, and it is not obvious why. Refuse to exec in a frozen container. Add a test case. Signed-off-by: Kir Kolyshkin <[email protected]>

runc run: refuse a frozen cgroup

79ccdaa

Sometimes a container cgroup already exists but is frozen. When this happens, runc init hangs, and it's not clear what is going on. Refuse to run in a frozen cgroup; add a test case. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin force-pushed the no-hang-on-frozen branch from 76822a1 to 79ccdaa Compare September 14, 2021 17:57

kolyshkin mentioned this pull request Sep 14, 2021

cgroupv1: refactor and optimize #3215

Merged

kolyshkin modified the milestones: 1.1.0, 1.2.0 Sep 14, 2021

kolyshkin mentioned this pull request Sep 15, 2021

cgroup: make paths available, simplify getting manager #3216

Merged

kolyshkin mentioned this pull request Sep 23, 2021

run create/run/exec: refuse bad cgroup #3223

Merged

kolyshkin closed this Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc run/create: refuse non-empty cgroup; runc exec: refuse frozen cgroup #3131

runc run/create: refuse non-empty cgroup; runc exec: refuse frozen cgroup #3131

kolyshkin commented Aug 5, 2021 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

kolyshkin commented Sep 14, 2021

cyphar commented Sep 15, 2021

kolyshkin commented Sep 15, 2021 •

edited

Loading

cyphar commented Sep 15, 2021

kolyshkin commented Sep 23, 2021

runc run/create: refuse non-empty cgroup; runc exec: refuse frozen cgroup #3131

runc run/create: refuse non-empty cgroup; runc exec: refuse frozen cgroup #3131

Conversation

kolyshkin commented Aug 5, 2021 • edited Loading

I. cgroup manager refactoring

II. runc run/create: refuse non-empty cgroup

III. runc delete -f: don't hang on a paused container

IV. runc exec: refuse paused container

V. runc run: refuse cgroup if frozen

Proposed changelog entry

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

kolyshkin commented Sep 14, 2021

cyphar commented Sep 15, 2021

kolyshkin commented Sep 15, 2021 • edited Loading

cyphar commented Sep 15, 2021

kolyshkin commented Sep 23, 2021

kolyshkin commented Aug 5, 2021 •

edited

Loading

kolyshkin commented Sep 15, 2021 •

edited

Loading