-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement create and start #827
Implement create and start #827
Conversation
Nice. |
@julz thanks, plz take the time to play with it, runc is fully operational. lets make sure that it will fit and work with all of our needs but I think this implementation is simple and clean and will work fine. |
|
||
status, err := startContainer(context, spec) | ||
if err != nil { | ||
if err := container.Signal(syscall.SIGCONT); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably needs at least a container.Status()
guard like you added to delete
in 572055d, since you only want to send SIGCONT to runtime code (and not send it after user-specified code has been executed). That's not quite enough though, since “the pending signal set is preserved across an execve(2)”. So “check the state and send SIGCONT if it was ‘created’” is going to be racy, and that race may result in user code seeing the extra SIGCONTs. A more robust solution would lock a resource (setting a flag in the state registry? Hold a Unix socket open?) when triggering code execution to avoid racing between two ‘start’ calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also simply reset the signal handler before doing the execve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Mon, May 16, 2016 at 09:00:22PM -0700, Kenfe-Mickaël Laventure wrote:
We can also simply reset the signal handler before doing the execve.
That happens automatically, no? Also from signal(7):
“During an execve(2), the dispositions of handled signals are reset
to the default; the dispositions of ignored signals are left
unchanged.”
What needs to happen is that after a SIGCONT is received, you block
(somehow) ‘create’ from sending further SIGCONTs, then consume any
SIGCONTs from the pending queue, and then execve the user code.
So if host rebooted, all created containers would be gone? Is that acceptable? |
On Fri, May 20, 2016 at 01:54:26AM -0700, Qiang Huang wrote:
That's fine for me (it's how all my other processes work ;). If you |
572055d
to
627c980
Compare
@hqhq I think that is okay as this isn't the same as docker create. As long as we are clear in the spec, it should be fine. |
627c980
to
285afd3
Compare
One of the use-cases for hooks was to customize mounts on demand. Would it be possible to not pivot root as part of create and switch root only on start? |
On Fri, May 20, 2016 at 03:31:07PM -0700, Vish Kannan wrote:
I don't think that's a good idea, because you may be using the I'm still not clear on why the pre-pivot mounts need to be dynamic, |
285afd3
to
72b4127
Compare
FYI This change makes us require go 1.6 because previous versions of go would not let you handle SIGCONT, it just ignores it and blocks forever. |
Device: "mqueue", | ||
Flags: defaultMountFlags, | ||
}, | ||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this temporarily commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the issue of mqueue not working on debian kernels on the CI in userns
// ContainerDestroyed - Container no longer exists, | ||
// ConfigInvalid - config is invalid, | ||
// ContainerPaused - Container is paused, | ||
// Systemerror - System error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SystemError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably comment for Start should be changed to describe why it's waiting on signal.
I see that all test were changed so they use old version. Maybe couple of tests for create/start are needed. |
would it be bad if we had the |
@duglin No, because not all containers use a TTY. |
@crosbymichael how do I do a |
@duglin does your spec have |
ah that was it - thanks |
:-) we could still have it default to that value when terminal is true |
@crosbymichael I see timeout on CI and somehow it hangs now :/ |
41a5d5a
to
1f7eef6
Compare
@cyphar thanks! |
Tested. LGTM. |
One of the biggest issues I can see right off the bat is that As an aside, it's quite annoying that you can't even specify /cc @crosbymichael |
@cyphar that is unrelated to this PR. I really don't think you all understand what |
@crosbymichael explained on IRC that |
This gives us a more portable way to discover the container exit code (vs. requiring callers to use subreapers [1] or other platform-specific approaches which require knowledge of the runtime implementation). [1]: opencontainers/runc#827 (comment) Signed-off-by: W. Trevor King <[email protected]>
I added this as an option in 5033c59 (Add an --id option to 'start', 2015-09-15), because some callers might want to leave ID generation to the runtime. When there is a long-running host process waiting on the container process to perform cleanup, the runtime-caller may not need to know the container ID. However, runC has been requiring a user-specified ID since [1], and the coming create/start split will follow the early-exit 'create' from [2], so require an ID here. We can revisit this if we regain a long-running 'create' process. You can create a config that adds no isolation vs. the runtime namespace or completely joins another set of existing namespaces. It seems odd to call that a new "container", but the ID is really more of a process ID, and less of a container ID. The "container" phrasing is just a useful hint that there might be some isolation going on. And we're always creating a new "container process" with 'start' (which will become 'create'). [1]: opencontainers/runc#541 opencontainers/runc@a7278cad (Require container id as arg1, 2016-02-08, opencontainers/runc#541) [2]: opencontainers/runc#827 Summary: Implement create and start Signed-off-by: W. Trevor King <[email protected]>
Catch up with opencontainers/runtime-spec@be594153 (Split create and start, 2016-04-01, opencontainers/runtime-spec#384). One benefit of the early-exit 'create' is that the exit code does not conflate container process exits with "failed to setup the sandbox" exits. We can take advantage of that and use non-zero 'create' exits to allow stderr writing (so the runtime can log errors while dying without having to successfully connect to syslog or some such). I still likes the long-running 'create' API because it makes collecting the exit code easier. I've proposed an 'event' operation [1] which would provide a convenient created trigger. With 'event' in place, we don't need the 'create' process exit to serve as that trigger, and could have a long-running 'create' that collects the container process exit code using the portable waitid() family. But the consensus after the 2016-07-13 meeting was to table that while we land docs for the runC API [2], and runC has an early-exit create [3]. The "Callers MAY block..." wording is going to be hard to enforce, but with the runC model, clients rely on the command exits to trigger post-create and post-start activity. The longer the runtime hangs around after completing its action, the laggier those triggers will be. The "MUST NOT attempt to read from its stdin" means a generic caller can safely exec the command with a closed or null stdin, and not have to worry about the command blocking or crashing because of that. The stdout spec for start/delete is more lenient, because runtimes are unlikely to change their behavior because they are unable to write to stdout. If this assumption proves troublesome, we may have to tighten it up later. The ptrace idea in this commit is from Mrunal [4]. [1]: opencontainers/runtime-spec#508 Subject: runtime: Add an 'event' operation for subscribing to pushes [2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15 [3]: opencontainers/runc#827 Summary: Implement create and start [4]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54 Signed-off-by: W. Trevor King <[email protected]>
I added this as an option in 5033c59 (Add an --id option to 'start', 2015-09-15), because some callers might want to leave ID generation to the runtime. When there is a long-running host process waiting on the container process to perform cleanup, the runtime-caller may not need to know the container ID. However, runC has been requiring a user-specified ID since [1], and the coming create/start split will follow the early-exit 'create' from [2], so require an ID here. We can revisit this if we regain a long-running 'create' process. You can create a config that adds no isolation vs. the runtime namespace or completely joins another set of existing namespaces. It seems odd to call that a new "container", but the ID is really more of a process ID, and less of a container ID. The "container" phrasing is just a useful hint that there might be some isolation going on. And we're always creating a new "container process" with 'start' (which will become 'create'). [1]: opencontainers/runc#541 opencontainers/runc@a7278cad (Require container id as arg1, 2016-02-08, opencontainers/runc#541) [2]: opencontainers/runc#827 Summary: Implement create and start Signed-off-by: W. Trevor King <[email protected]>
Catch up with opencontainers/runtime-spec@be594153 (Split create and start, 2016-04-01, opencontainers/runtime-spec#384). One benefit of the early-exit 'create' is that the exit code does not conflate container process exits with "failed to setup the sandbox" exits. We can take advantage of that and use non-zero 'create' exits to allow stderr writing (so the runtime can log errors while dying without having to successfully connect to syslog or some such). I still likes the long-running 'create' API because it makes collecting the exit code easier. I've proposed an 'event' operation [1] which would provide a convenient created trigger. With 'event' in place, we don't need the 'create' process exit to serve as that trigger, and could have a long-running 'create' that collects the container process exit code using the portable waitid() family. But the consensus after the 2016-07-13 meeting was to table that while we land docs for the runC API [2], and runC has an early-exit create [3]. The "Callers MAY block..." wording is going to be hard to enforce, but with the runC model, clients rely on the command exits to trigger post-create and post-start activity. The longer the runtime hangs around after completing its action, the laggier those triggers will be. The "MUST NOT attempt to read from its stdin" means a generic caller can safely exec the command with a closed or null stdin, and not have to worry about the command blocking or crashing because of that. The stdout spec for start/delete is more lenient, because runtimes are unlikely to change their behavior because they are unable to write to stdout. If this assumption proves troublesome, we may have to tighten it up later. The ptrace idea in this commit is from Mrunal [4]. [1]: opencontainers/runtime-spec#508 Subject: runtime: Add an 'event' operation for subscribing to pushes [2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2016/opencontainers.2016-07-13-17.03.log.html#l-15 [3]: opencontainers/runc#827 Summary: Implement create and start [4]: http://ircbot.wl.linuxfoundation.org/eavesdrop/%23opencontainers/%23opencontainers.2016-07-13.log.html#t2016-07-13T18:58:54 Signed-off-by: W. Trevor King <[email protected]>
…hortcut config-linux: Use the implicit link name shortcut
Address a previous TODO. And now that we are using --bundle, we no longer need to set cmd.Dir. The TODO mentions a lack of runc support, but runc supports --bundle since opencontainers/runc@3fe7d7f3 (Add create and start command for container lifecycle, 2016-05-13, opencontainers/runc#827).
Address a previous TODO. And now that we are using --bundle, we no longer need to set cmd.Dir. The TODO mentions a lack of runc support, but runc supports --bundle since opencontainers/runc@3fe7d7f3 (Add create and start command for container lifecycle, 2016-05-13, opencontainers/runc#827). Signed-off-by: W. Trevor King <[email protected]>
Address a previous TODO. And now that we are using --bundle, we no longer need to set cmd.Dir. The TODO mentions a lack of runc support, but runc supports --bundle since opencontainers/runc@3fe7d7f3 (Add create and start command for container lifecycle, 2016-05-13, opencontainers/runc#827). Signed-off-by: W. Trevor King <[email protected]>
This avoids a panic for containers that do not set Process. And even if Process was set, there is no reason to require the executable to be available *at create time* [1]. Subsequent activity could be scheduled to get a binary in place at the configured location before 'start' is called. [1]: opencontainers#827 (comment) Signed-off-by: W. Trevor King <[email protected]>
This implements create and start in runc without the need for a unix socket or other complexity. It implements it by blocking the init process waiting on a
SIGCONT
before the users process is started.This does not remove hooks, that can be done separately. This also updates the libcontainer stats to correctly report if the container is created vs running(user code).
It does not bind mount namespaces, if you want namespaces to be bind mounted then you can write code to bind them after
create
returns and before callingstart
.It retains the current functionality of
start
today by adding arunc run
command that does the same workflow as today.Closes #506