Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ CAPD: set container restartPolicy to unless-stopped #5021

Merged
merged 1 commit into from
Jul 28, 2021

Conversation

stmcginnis
Copy link
Contributor

What this PR does / why we need it:

If the docker engine is restarted, either through a service restart or a
full system reboot, the containers created by CAPD are left in a stopped
state.

This adds the restartPolicy setting to the Docker container creation so
that containers will be automatically restarted on service restart
unless the user has explicitly stopped the container prior to the
restart.

Which issue(s) this PR fixes:
Fixes #5020

If the docker engine is restarted, either through a service restart or a
full system reboot, the containers created by CAPD are left in a stopped
state.

This adds the restartPolicy setting to the Docker container creation so
that containers will be automatically restarted on service restart
unless the user has explicitly stopped the container prior to the
restart.

Signed-off-by: Sean McGinnis <[email protected]>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 27, 2021
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 27, 2021
Copy link
Member

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stmcginnis, thanks for this PR, getting CAPD cluster to survive restarts is super nice!

NetworkMode: dockercontainer.NetworkMode(runConfig.Network),
Tmpfs: runConfig.Tmpfs,
PortBindings: nat.PortMap{},
RestartPolicy: dockercontainer.RestartPolicy{Name: "unless-stopped"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what is the CAPD behaviour when a container is stopped? Is it going to fail (and possibly surface and error) or is it going to recreate a new container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may depend on when and how the stop is done.

If the entire service is stopped, then everything goes down and is happy when it starts back up.

Testing locally once everything is in a steady state, it looks like things are still happy with just stopping the MachineDeployment containers while leaving others running. I think this may be because we don't actually look at the Status of the container if it's been found, but we are setting the actual status here:

for _, cntr := range containers {
name := cntr.Name
cluster := clusterLabelKey
image := cntr.Image
status := cntr.Status
visit(cluster, types.NewNode(name, image, "undetermined").WithStatus(status))

Stopping the controlplane container does result in these errors in the logs:

capi-kubeadm… │ E0727 19:51:03.476496       8 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: context deadline exceeded

But again we just keep going.

I would imagine if a container is stopped at some point during initial deployment that it could cause an issue.

It's a really good question that we should probably think about. We may want to try to reconcile state if we find one of the containers has an "exited" status. Let me know if you think I should file an issue for that. I think it's a separate concern as it's a behavior we have today even without this change.

So at least with this, if someone has not manually stopped a container, a restart will get it all back. The other two restartPolicy options are always and on-failure. on-failure would only restart if the container exited with a non-0 exit code, so that would not change the behavior if someone were to manually stop the container. And always only would restart the container on Docker Engine restart if it was manually stopped.

@sbueringer
Copy link
Member

/lgtm

I also think unless-stopped is the ideal policy for our case:

Screenshot 2021-07-28 at 07 54 28

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 28, 2021
@fabriziopandini
Copy link
Member

Thanks for the detailed explanation
/lgtm
/approve

Let's file an issue requiring to properly surface "a container already exists, but it is stopped"; I don't think we should automatically remediate, but let's make sure the user is not required to look at the controller logs to find out what is going on.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CAPD containers are stopped after a Docker Engine restart
4 participants