✨ CAPD: set container restartPolicy to unless-stopped #5021

stmcginnis · 2021-07-27T18:40:46Z

What this PR does / why we need it:

If the docker engine is restarted, either through a service restart or a
full system reboot, the containers created by CAPD are left in a stopped
state.

This adds the restartPolicy setting to the Docker container creation so
that containers will be automatically restarted on service restart
unless the user has explicitly stopped the container prior to the
restart.

Which issue(s) this PR fixes:
Fixes #5020

If the docker engine is restarted, either through a service restart or a full system reboot, the containers created by CAPD are left in a stopped state. This adds the restartPolicy setting to the Docker container creation so that containers will be automatically restarted on service restart unless the user has explicitly stopped the container prior to the restart. Signed-off-by: Sean McGinnis <[email protected]>

fabriziopandini

@stmcginnis, thanks for this PR, getting CAPD cluster to survive restarts is super nice!

fabriziopandini · 2021-07-27T19:02:42Z

test/infrastructure/container/docker.go

+		NetworkMode:   dockercontainer.NetworkMode(runConfig.Network),
+		Tmpfs:         runConfig.Tmpfs,
+		PortBindings:  nat.PortMap{},
+		RestartPolicy: dockercontainer.RestartPolicy{Name: "unless-stopped"},


Q: what is the CAPD behaviour when a container is stopped? Is it going to fail (and possibly surface and error) or is it going to recreate a new container?

This may depend on when and how the stop is done.

If the entire service is stopped, then everything goes down and is happy when it starts back up.

Testing locally once everything is in a steady state, it looks like things are still happy with just stopping the MachineDeployment containers while leaving others running. I think this may be because we don't actually look at the Status of the container if it's been found, but we are setting the actual status here:

cluster-api/test/infrastructure/docker/docker/util.go

Lines 98 to 103 in c90f376

for _, cntr := range containers {

name := cntr.Name

cluster := clusterLabelKey

image := cntr.Image

status := cntr.Status

visit(cluster, types.NewNode(name, image, "undetermined").WithStatus(status))

Stopping the controlplane container does result in these errors in the logs:

capi-kubeadm… │ E0727 19:51:03.476496 8 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1.ClusterRole: failed to list *v1.ClusterRole: context deadline exceeded

But again we just keep going.

I would imagine if a container is stopped at some point during initial deployment that it could cause an issue.

It's a really good question that we should probably think about. We may want to try to reconcile state if we find one of the containers has an "exited" status. Let me know if you think I should file an issue for that. I think it's a separate concern as it's a behavior we have today even without this change.

So at least with this, if someone has not manually stopped a container, a restart will get it all back. The other two restartPolicy options are always and on-failure. on-failure would only restart if the container exited with a non-0 exit code, so that would not change the behavior if someone were to manually stop the container. And always only would restart the container on Docker Engine restart if it was manually stopped.

sbueringer · 2021-07-28T05:55:10Z

/lgtm

I also think unless-stopped is the ideal policy for our case:

fabriziopandini · 2021-07-28T12:47:09Z

Thanks for the detailed explanation
/lgtm
/approve

Let's file an issue requiring to properly surface "a container already exists, but it is stopped"; I don't think we should automatically remediate, but let's make sure the user is not required to look at the controller logs to find out what is going on.

k8s-ci-robot · 2021-07-28T12:47:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [fabriziopandini]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 27, 2021

k8s-ci-robot requested review from fabriziopandini and vincepri July 27, 2021 18:40

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jul 27, 2021

fabriziopandini reviewed Jul 27, 2021

View reviewed changes

k8s-ci-robot assigned sbueringer Jul 28, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 28, 2021

k8s-ci-robot assigned fabriziopandini Jul 28, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2021

k8s-ci-robot merged commit 52b9760 into kubernetes-sigs:master Jul 28, 2021

k8s-ci-robot added this to the v0.4 milestone Jul 28, 2021

stmcginnis deleted the capd-restart branch July 28, 2021 13:34

stmcginnis mentioned this pull request Jul 28, 2021

Decide how to handle CAPD containers being stopped #5026

Closed

stmcginnis mentioned this pull request Aug 9, 2021

CAPD: determine how to properly handle stopped containers #5062

Closed

stmcginnis mentioned this pull request Aug 26, 2021

Containers don't restart after CAPD install vmware-tanzu/community-edition#832

Closed

jaxesn mentioned this pull request Sep 13, 2021

After restarting the machine that was running the single-system method EKS Anywhere, I lost my EKS-A Cluster. aws/eks-anywhere#187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ CAPD: set container restartPolicy to unless-stopped #5021

✨ CAPD: set container restartPolicy to unless-stopped #5021

stmcginnis commented Jul 27, 2021

fabriziopandini left a comment

fabriziopandini Jul 27, 2021

stmcginnis Jul 27, 2021

sbueringer commented Jul 28, 2021

fabriziopandini commented Jul 28, 2021

k8s-ci-robot commented Jul 28, 2021

	for _, cntr := range containers {
	name := cntr.Name
	cluster := clusterLabelKey
	image := cntr.Image
	status := cntr.Status
	visit(cluster, types.NewNode(name, image, "undetermined").WithStatus(status))

✨ CAPD: set container restartPolicy to unless-stopped #5021

✨ CAPD: set container restartPolicy to unless-stopped #5021

Conversation

stmcginnis commented Jul 27, 2021

fabriziopandini left a comment

Choose a reason for hiding this comment

fabriziopandini Jul 27, 2021

Choose a reason for hiding this comment

stmcginnis Jul 27, 2021

Choose a reason for hiding this comment

sbueringer commented Jul 28, 2021

fabriziopandini commented Jul 28, 2021

k8s-ci-robot commented Jul 28, 2021