Separate container sandbox lifecycle from that of the processes inside it #299

vishh · 2016-01-12T23:13:35Z

As of now, the runtime lifecycle states that there is a Start action which results in creation of a container sandbox and spawning the first process inside the container.
It would be better to separate lifecycle of the container sandbox from that of container processes.
To be more precise, I'm proposing introducing the following oci actions:

Create - This action will create a container sandbox. On Linux, this would be all cgroups and namespaces, excepting pid namespace (assuming all cgroups and namespaces were requested as part of the Spec).
Start - This action will spawn a new process in the existing container sandbox. On Linux, a new pid namespace will be created for the first process that is being spawned. Subsequent processes will run the same pid namespace.
Cleanup - This action will kill all processes in a container sandbox. In the case of Linux, this will result in loss of pid namespace.
Delete - This action will delete the container sandbox.

This split will obviate the need for supporting hooks. Users are free to run hooks outside of the runtime as they please.

The existing Spec needs to be split into a Sandbox configuration and a Process configuration.
The process configuration can be reused for the Exec use-case.

FYI: This was discussed during the OCI meeting on 1/12/16 and everyone present agreed to this proposal.

We need to prototype this separation in one or more Operating Systems before requiring it in the Spec.

Input from runtime authors on non-linux platforms will be helpful here.

The text was updated successfully, but these errors were encountered:

ashahab-altiscale · 2016-01-12T23:23:18Z

I'd suggest the prototype should test with usernamespaces, especially in light of bugs like this one:
opencontainers/runc#101
Related patch: https://github.com/opencontainers/runc/pull/105/files

The init process of the container needs to be able to join a pre-created user namespace.
If this does not work, it would not possible to create any namespaces, as user namespaces require all other namespaces to be children of the user namespace.

vishh · 2016-01-12T23:29:26Z

@ashahab-altiscale: As long as user namespaces don't require a process to pin it, I don't see any issue here.

wking · 2016-01-12T23:52:51Z

On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:

As of now, the runtime lifecycle states that there is a Start
action which results in creation of a container sandbox and spawning
the first process inside the container. It would be better to
separate lifecycle of the container sandbox from that of container
processes.

This seems like a high-level change. Should discussion happen on the
list 1? There are earlier notes on this split approach in 2, as
well as some previous discussion along these lines in #226. If we
decide to have this conversation here (instead of the list or earlier
issue), I can write up a summary of past points.

This split will obviate the need for supporting hooks. Users are
free to run hooks outside of the runtime as they please.

Yeah, if we have a separate create and exec, we can drop hooks and
close #265 3.

The existing Spec needs to be split into a Sandbox configuration and
a Process configuration.

Not necessarily. ccon-ced (discussed in #226) wraps a runtime that
uses a single spec to do this. To pull that off, you need a spec that
makes it easy to turn on only the functionality you want (e.g. “create
namespaces” vs. “join, [create a new PID namespace,] and exec a
process”).

 Subject: Documenting container lifecycles
 Date: Mon, 14 Sep 2015 20:47:17 -0700
 Message-ID: <[email protected]>

jonboulle · 2016-01-12T23:56:26Z

s/Support/Separate in issue title

It would be better to separate lifecycle of the container sandbox from that of container processes.

(emphasis mine)
more justification/use cases here would be useful, at the very least for posterity

This split will obviate the need for supporting hooks.

To be clear, are you suggesting removing hooks entirely from the spec?

Subsequent processes will run the same pid namespace.

It makes me pretty uneasy to have the start operation be nondeterministic - e.g. if I run two start commands simultaneously it's a race which one's process becomes PID1 (which has non-trivial implications).

wking · 2016-01-13T00:06:02Z

On Tue, Jan 12, 2016 at 03:56:28PM -0800, Jonathan Boulle wrote:

Subsequent processes will run the same pid namespace.

It makes me pretty uneasy to have the start operation be
nondeterministic - e.g. if I run two start commands simultaneously
it's a race which one's process becomes PID1 (which has non-trivial
implications).

Yeah, -1 to having PID namespace be implicit. If you run starts
(parallel or not) that both create PID namespaces, then they both get
to be PID 1 in their own PID namespace. If you run a start (parallel
or not) that does not create a PID namespace, it does not get to be
PID 1. So I don't see a problem with racing if we have explicit
PID-namespace creation.

On naming, see ‘exec’ vs. ‘start’ here in opencontainers/runc#210.

vbatts · 2016-01-13T00:51:35Z

On Tue, Jan 12, 2016 at 6:52 PM W. Trevor King [email protected]
wrote:

On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:

As of now, the runtime lifecycle states that there is a Start
action which results in creation of a container sandbox and spawning
the first process inside the container. It would be better to
separate lifecycle of the container sandbox from that of container
processes.

This seems like a high-level change. Should discussion happen on the
list [1]?

This is because we're sitting in the face-2-face and working through this
presently.

wking · 2016-01-13T01:00:05Z

On Tue, Jan 12, 2016 at 04:51:36PM -0800, Vincent Batts wrote:

On Tue, Jan 12, 2016 at 6:52 PM W. Trevor King wrote:

On Tue, Jan 12, 2016 at 03:13:37PM -0800, Vish Kannan wrote:

As of now, the runtime lifecycle states that there is a Start
action which results in creation of a container sandbox and
spawning the first process inside the container. It would be
better to separate lifecycle of the container sandbox from that
of container processes.

This seems like a high-level change. Should discussion happen on
the list [1]?

This is because we're sitting in the face-2-face and working through
this presently.

Yeah, that's fine. But when dumping from the face-to-face into the
archives, I think you should dump into a list post, not into a new
issue (like this one).

wking · 2016-01-13T16:32:36Z

On Tue, Jan 12, 2016 at 03:29:28PM -0800, Vish Kannan wrote:

@ashahab-altiscale: As long as user namespaces don't require a
process to pin it, I don't see any issue here.

The issue is the order in which you join or create namespaces, which
impacts permissions. For example, an unprivileged user can join a
user namespace they created earlier, and then create a new PID
namespace underneath. But they can't create a new PID namespace
before joining a user namespace they created (or creating a new user
namespace). There are a number of comments to this effect in
opencontainers/runc#105, and also some notes in the “Interaction of
user namespaces and other types of namespaces” section of
user_namespaces(7). For a command line example:

Make a new user namespace:
—
$ unshare --user --map-root-user sh
sh-4.3# echo $$
21171

In a different terminal, trying to create a new PID namespace and join
the user namespace fails:
—
$ unshare --pid nsenter --user=/proc/21171/ns/user --preserve-credentials sh
unshare: unshare failed: Operation not permitted

But trying to join the user namespace and then create a new PID
namespace succeeds:
—
$ nsenter --user=/proc/21171/ns/user --preserve-credentials unshare --pid sh
sh-4.3#

I think @ashahab-altiscale is just suggesting we avoid that issue by
proper namespace join/create ordering when implementing create/exec
tooling.

ashahab-altiscale · 2016-01-13T16:42:13Z

@wking Exactly. The order of namespace creation is important.

duglin · 2016-01-22T15:01:50Z

s/Cleanup/Stop/ but aside from that I like the idea of the split.

duglin · 2016-01-22T15:11:40Z

re: multiple "starts": based on what I've heard I don't think multiple concurrent "starts" would be allowed. I think I'm hearing that people still want a single main process in the container, which means once you've started it your only choice is to use "exec" to get a 2nd process. Once the main proc ended then you can issue "start" again tho.

wking · 2016-01-22T16:13:47Z

On Fri, Jan 22, 2016 at 07:11:42AM -0800, Doug Davis wrote:

re: multiple "starts": based on what I've heard I don't think
multiple concurrent "starts" would be allowed. I think I'm hearing
that people still want a single main process in the container, which
means once you've started it your only choice is to use "exec" to
get a 2nd process. Once the main proc ended then you can issue
"start" again tho.

Is the only difference between ‘start’ and ‘exec’ whether you're
creating a new PID namespace (and optionally mounting a new /proc?) or
inheriting/joining an existing one? It sounds like it might be, and
in that case, I'd rather just use our existing namespace create/join
schema and have a single ‘exec’ command. Trying to bake the
distinction into two separate commands seems like more trouble than
the usability gain is worth.

Or are we just splitting commands for portability (e.g. to allow
hypervisor and microkernel systems where it's difficult/impossible for
the host to execute a new process in an existing
PID-namespace/container)?

duglin · 2016-01-22T16:21:43Z

Changing what 'exec' will do based on whether its the first call is a bit non-deterministic. Imagine I do "runc exec" thinking it'll generate a secondary proc and as such I have different management logic for it. If it turns out it ends up being the main proc (because I didn't know the main proc just died) I might end up with very different (incorrect) results than what I expected.

wking · 2016-01-22T17:52:23Z

On Fri, Jan 22, 2016 at 08:21:45AM -0800, Doug Davis wrote:

Changing what 'exec' will do based on whether its the first call is
a bit non-deterministic.

Yeah, I think we're all on board with that being a bad idea. I'm
suggesting changing what ‘exec’ will do based on explicit JSON
configuration (e.g. if linux.namespaces.pid.path is true or
false/unset). That's how we handle create-or-join currently with the
“just ‘start’” approach.

vishh · 2016-01-22T20:03:26Z

+1 on splitting the init process from secondary process.

On Fri, Jan 22, 2016 at 9:52 AM W. Trevor King [email protected]
wrote:

On Fri, Jan 22, 2016 at 08:21:45AM -0800, Doug Davis wrote:

Changing what 'exec' will do based on whether its the first call is
a bit non-deterministic.

Yeah, I think we're all on board with that being a bad idea. I'm
suggesting changing what ‘exec’ will do based on explicit JSON
configuration (e.g. if linux.namespaces.pid.path is true or
false/unset). That's how we handle create-or-join currently with the
“just ‘start’” approach.

—
Reply to this email directly or view it on GitHub
#299 (comment)
.

crosbymichael · 2016-01-22T20:06:49Z

What should the full UI be for this change?

cd redis-bundle/

# create the container
container create redis

# start the init process defined in the bundle
container start redis

# start additional processes
contianer exec redis -- bash

# kill the container

container kill redis

# delete the container
container delete redis

??

vishh · 2016-01-22T20:09:21Z

Yeah. That's the UI I'm thinking as well.

On Fri, Jan 22, 2016 at 12:06 PM Michael Crosby [email protected]
wrote:

What should the full UI be for this change?

cd redis-bundle/

create the container

container create redis

start the init process defined in the bundle

container start redis

start additional processes

contianer exec redis -- bash

kill the container

container kill redis

delete the container

container delete redis

??

—
Reply to this email directly or view it on GitHub
#299 (comment)
.

duglin · 2016-01-22T20:09:22Z

yes but I'd prefer "container stop redis" since "stop" feels like a better opposite/book-end to "start" than "kill". Yes I know under the covers we send a kill signal, but that's a detail the end user doesn't need to know or care about.

crosbymichael · 2016-01-22T20:10:04Z

@vishh great minds think alike ;)

wking · 2016-01-22T20:11:42Z

On Fri, Jan 22, 2016 at 12:09:24PM -0800, Doug Davis wrote:

Yes I know under the covers we send a kill signal, but that's a
detail the end user doesn't need to know or care about.

Unless they're putting in a SIGTERM handler for a graceful exit, and
are wondering why ‘stop’ isn't triggering that ;). So I think we do
want to be explicit about which singnals the runtime MUST use to
achieve a given action.

duglin · 2016-01-22T20:22:15Z

I'm distinguishing between runc stop <id> and runc kill <id> <signal #>

wking · 2016-01-22T20:29:47Z

On Fri, Jan 22, 2016 at 12:06:51PM -0800, Michael Crosby wrote:

create the container

container create redis

start the init process defined in the bundle

container start redis

start additional processes

contianer exec redis -- bash

Because start/exec will use some currently-configured settings that
aren't needed at create time (e.g. config.json's process,
runtime.json's linux.namespaces.pid, and any config around mounting a
PID-specific /proc), I recommend splitting our config entries between
a create.json (only the stuff needed by ‘create’) and exec.json (only
the stuff needed by start/exec). To give an explicit example of my
suggestion in 1 for the above, I'd have something like:

$ cat exec.json
{
"process": {
"terminal": false,
"user": {
"uid": 1000,
"gid": 1000
},
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
],
"cwd": "/",
"args": ["redis-server", "--bind", "0.0.0.0"]
},
"linux": {
"namespaces": [
{
"type": "pid"
}
]
},
"hooks": {
"prestart": [
{
"args": ["cgroup-manager", "add"]
}
]
}
}

to handle and initial:

$ container exec redis

And a:

$ cat bash.json
{
"process": {
"terminal": true,
"user": {
"uid": 1000,
"gid": 1000
},
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
],
"cwd": "/",
"args": ["bash"]
},
"linux": {
"namespaces": [
{
"type": "pid",
"path": "/proc/1234/ns/pid",
}
]
}
}

to handle a subsequent:
—
$ container exec --config bash.json redis

You can always write separate ‘start’ and ‘exec’ wrappers on top of an
interface like that (which would work with any OCI-compliant runtime).
Of course, with sufficient opt-out flexibility, you'd be able to write
wrappers for create, start, and exec all based on a single underlying
runtime command (which is how ccon-ced works 2).

If we go with the separate, runtime-level start/exec approach, I think
we'll have trouble adding flags for all the options like “do you want
this to go in the container cgroups?”, which you might want in some
cases, but not in others.

wking · 2016-01-22T20:30:38Z

On Fri, Jan 22, 2016 at 12:22:17PM -0800, Doug Davis wrote:

I'm distinguishing between runc stop <id> and runc kill <id> <signal #>

If you have the latter (which I prefer), what good is the former?

ashahab-altiscale · 2016-01-22T21:40:52Z

What is the state of the system after a kill/stop with SIGTERM? I imagine the init process and it's children will die. What about the namespaces? Which fds will remain mounted vs being cleaned up? Can I start another init process with the same namespace fds(after a stop/kill)?

wking · 2016-01-22T22:13:24Z

On Fri, Jan 22, 2016 at 01:40:54PM -0800, Abin Shahab wrote:

What is the state of the system after a kill/stop with SIGTERM? I
imagine the init process and it's children will die.

After a SIGTERM they should all exit cleanly. But SIGTERM is
catchable, and one person's “exiting cleanly” is another's “not dying
fast enough”. After SIGKILLing a container's PID 1 (assuming it has a
container-specific PID namespace), they will all be dead.

What about the namespaces? Which fds will remain mounted vs being
cleaned up? Can I start another init process with the same namespace
fds?

Everything except the PID namespace should have been created by the
‘create’ step, so you should be able to start another
PID-namespace-creating process that still uses those non-PID
namespaces (and presumably also reuses cgroups, etc.).

vishh · 2016-01-22T22:34:32Z

Re-using the sandbox should be possible, except that the sandbox might
contain some remnants from the previous run, for example filesystem.

On Fri, Jan 22, 2016 at 2:13 PM, W. Trevor King [email protected]
wrote:

On Fri, Jan 22, 2016 at 01:40:54PM -0800, Abin Shahab wrote:

What is the state of the system after a kill/stop with SIGTERM? I
imagine the init process and it's children will die.

After a SIGTERM they should all exit cleanly. But SIGTERM is
catchable, and one person's “exitting cleanly” is anothers “not dying
fast enough”. After SIGKILLing a container's PID 1 (assuming it has a
container-specific PID namespace), they will all be dead.

What about the namespaces? Which fds will remain mounted vs being
cleaned up? Can I start another init process with the same namespace
fds?

Everything except the PID namespace should have been created by the
‘create’ step, so you should be able to start another
PID-namespace-creating process that still uses those non-PID
namespaces (and presumably also reuses cgroups, etc.).

—
Reply to this email directly or view it on GitHub
#299 (comment)
.

duglin · 2016-01-22T22:44:02Z

I sure hope so :-)

hqhq · 2016-01-25T09:31:18Z

Create - This action will create a container sandbox. On Linux, this would be all cgroups and namespaces, excepting pid namespace (assuming all cgroups and namespaces were requested as part of the Spec).

AFAIK, there are only two APIs which can be used to create new namespaces:

clone
unshare

They either create a new process or called by a process which will be a member of the new namespaces. So how can we create new namespaces without creating new processes?
And create empty cgroups dirs without putting any processes in doesn't make much sense to me, they should be attached to start stage IMHO.

duglin · 2016-01-25T14:24:36Z

See opencontainers/runc#506 we do create a new process during create(), but we hold on to the namespaces (except PID) so that subsequent processes can reuse those other namespaces

wking · 2016-01-25T18:23:03Z

On Mon, Jan 25, 2016 at 06:24:39AM -0800, Doug Davis wrote:

See opencontainers/runc#506

For a more compact reference, see 1 and references to bind mounting
in “The /proc/[pid]/ns/ directory” section of 2.

crosbymichael · 2017-02-28T00:52:22Z

@vishh do you think there are any actionable items left in this issue? We did the create/start/delete split and you have an option to never actually start the container's process but keep adding exec processes after it is created. Because of pid1 and the pid namespace issues on linux I don't think we will ever have a fully robust persistent sandbox on linux without higher level tooling adding to the functionality ( bind mounting namespaces and sharing between containers )

vishh added the features/lifecycle label Jan 12, 2016

vishh added this to the by-1.0.0 milestone Jan 12, 2016

vishh changed the title ~~Support container sandbox lifecycle from that of the processes inside it~~ Separate container sandbox lifecycle from that of the processes inside it Jan 12, 2016

wking mentioned this issue Jan 18, 2016

Support Lifecycle Hooks #20

Closed

vbatts modified the milestones: v0.5.0, by-1.0.0 Mar 16, 2016

wking mentioned this issue Mar 16, 2016

Exec semantics #345

Closed

philips modified the milestones: v0.6.0, v0.5.0 Apr 6, 2016

crosbymichael modified the milestones: v0.6.0, 1.0.0 May 25, 2016

crosbymichael modified the milestones: 1.0.0, 1.1.0 Jun 3, 2016

Separate container sandbox lifecycle from that of the processes inside it #299

Separate container sandbox lifecycle from that of the processes inside it #299

Comments

vishh commented Jan 12, 2016

ashahab-altiscale commented Jan 12, 2016

vishh commented Jan 12, 2016

wking commented Jan 12, 2016

jonboulle commented Jan 12, 2016

wking commented Jan 13, 2016

vbatts commented Jan 13, 2016

wking commented Jan 13, 2016

wking commented Jan 13, 2016

ashahab-altiscale commented Jan 13, 2016

duglin commented Jan 22, 2016

duglin commented Jan 22, 2016

wking commented Jan 22, 2016

duglin commented Jan 22, 2016

wking commented Jan 22, 2016

vishh commented Jan 22, 2016

crosbymichael commented Jan 22, 2016

vishh commented Jan 22, 2016

create the container

start the init process defined in the bundle

start additional processes

kill the container

delete the container

duglin commented Jan 22, 2016

crosbymichael commented Jan 22, 2016

wking commented Jan 22, 2016

duglin commented Jan 22, 2016

wking commented Jan 22, 2016

create the container

start the init process defined in the bundle

start additional processes

wking commented Jan 22, 2016

ashahab-altiscale commented Jan 22, 2016

wking commented Jan 22, 2016

vishh commented Jan 22, 2016

duglin commented Jan 22, 2016

hqhq commented Jan 25, 2016

duglin commented Jan 25, 2016

wking commented Jan 25, 2016

crosbymichael commented Feb 28, 2017