Pods Terminating forever due to Docker 17.09-ce Bug #1135

jorge07 · 2018-02-15T11:17:25Z

We've a cluster created with v0.9.9, k8s 1.8.4, and docker version 17.09.1-ce

Steps to Reproduce

Enter in a host and run:

docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit
exit

[nothing happens]

docker ps still displaying the ubuntu container

I was looking at this further for some time but I didn't get far in diagnosing the issue. I know @mumoshu has been running at least a dev cluster on the latest container linux AMI with docker 17.09 and that's been stable. For the last month or so I've been unable to gain a healthy cluster in dev that I can prompt to other environments. I'm considering also downgrading for now.

Related comments:
#728 (comment)
#941 (comment)

mumoshu · 2018-02-23T07:57:04Z

So I basically believe that:

We should just recommend users to stick to old AMIs or use custom AMI to use the old docker forK8S 1.8.x
While recommending users to just use K8S 1.9.x

And for @c-knowles's case, we should just make sure K8S 1.9.x is compatible with the newer docker?

@c-knowles Would you mind sharing a detailed configuration of your cluster, so that we could probably diagnose the root cause(s) together with us and folks from upstream?

cknowles · 2018-02-25T13:53:11Z

@mumoshu sure I'd be happy to diagnose with some help. My dev setup is here, the only bit I simplified was etcd to remove some custom units which install datadog (etcd seems stable anyway).

If anyone is interested in the setup for packer, this is the provisioner you need:

{
    "type": "shell",
    "inline": [
      "sudo mkdir -p /etc/coreos",
      "echo yes | sudo tee /etc/coreos/docker-1.12",
      "sudo systemctl stop docker.service",
      "sudo rm -rf /var/lib/docker"
    ]
}

mumoshu · 2018-02-25T15:06:02Z

@c-knowles Thx for sharing! I did read thorough your config. I suspect your t2.medium nodes are loaded too much workloads.
Can you take notes of the instance type of node encountered that docker hang thing from now?
Also, would you mind trying one with more consistent cpu, c4.large or m4.large at minimum? t2.medium goes crazy as soon as you do something more than running toy pods. I use them only for testing kube-aws's cluster provisioning process.

cknowles · 2018-02-25T15:13:10Z

@mumoshu I can try to roll another cluster side by side with this one as I've already changed it back to docker 1.12. The same node config works fine on docker 1.12 by the way. For workloads, the only thing this cluster is running are the pods that kube-aws creates plus a few nginx/traefik containers that aren't receiving much if any traffic (I'm acceptance testing some basic deploys/helm charts).

mumoshu · 2018-02-25T15:20:16Z

Thx! Then, my best guest at the moment is the newer docker somehow consumes a little bit more cpu than before/has some race-related issue triggered when low cpu.

Anyway, that was only thing I could read from the config.

If it isn't t2 issue at all, I guess asking docker devs for more assistance would be the only way.

cknowles · 2018-02-25T15:27:15Z

@mumoshu not sure if you spotted the minSize: 0 on two worker pools? I keep the cluster config aligned between environments as much as I can meaning same number of node pools. This dev cluster has a spot fleet but the workers stay pretty consistent at 1 x t2.medium on demand and 1 x c4.xlarge spot. Both the single controller and etcd nodes are m3.medium.

mumoshu · 2018-02-25T15:37:18Z

Thx! I didn't realize that but probably it
does't affect my guess? Your only t2.medium node seem likely to get crazy once cpu credit goes to zero.
Have you watched remaining cpu credit count when you saw docker issue btw?

thaJeztah · 2018-03-08T10:14:56Z

This looks related to moby/moby#36048 / moby/moby#36010, which was a bug in RunC (see opencontainers/runc#1698); that bug had been around for a long time, but wasn't triggered until the Meltdown/Spectre patches began to roll out

jorge07 · 2018-03-25T16:54:40Z

I made a PR in order to make AmiID required in cluster.yaml #1201. It will not fix the issue but will prevent random ami updates in cluster update.

fejta-bot · 2019-04-23T04:11:31Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-05-23T04:53:34Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2019-06-22T05:44:13Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2019-06-22T05:44:21Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mumoshu added kind/bug Categorizes issue or PR as related to a bug. documentation labels Feb 23, 2018

jorge07 mentioned this issue Apr 16, 2018

[Proposal] Disable CoreOS update-engine #1240

Closed

cknowles mentioned this issue Apr 17, 2018

What are the viable upgrade paths for a kube-aws 0.9.9 k8s 1.8 cluster to k8s 1.9 #1120

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2019

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 23, 2019

k8s-ci-robot closed this as completed Jun 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods Terminating forever due to Docker 17.09-ce Bug #1135

Pods Terminating forever due to Docker 17.09-ce Bug #1135

jorge07 commented Feb 15, 2018

cknowles commented Feb 15, 2018

jorge07 commented Feb 15, 2018

jorge07 commented Feb 16, 2018

jorge07 commented Feb 20, 2018

cknowles commented Feb 23, 2018

mumoshu commented Feb 23, 2018 •

edited

Loading

cknowles commented Feb 25, 2018

mumoshu commented Feb 25, 2018

cknowles commented Feb 25, 2018

mumoshu commented Feb 25, 2018

cknowles commented Feb 25, 2018

mumoshu commented Feb 25, 2018

thaJeztah commented Mar 8, 2018

jorge07 commented Mar 25, 2018

fejta-bot commented Apr 23, 2019

fejta-bot commented May 23, 2019

fejta-bot commented Jun 22, 2019

k8s-ci-robot commented Jun 22, 2019

Pods Terminating forever due to Docker 17.09-ce Bug #1135

Pods Terminating forever due to Docker 17.09-ce Bug #1135

Comments

jorge07 commented Feb 15, 2018

Steps to Reproduce

Related

cknowles commented Feb 15, 2018

jorge07 commented Feb 15, 2018

jorge07 commented Feb 16, 2018

jorge07 commented Feb 20, 2018

cknowles commented Feb 23, 2018

mumoshu commented Feb 23, 2018 • edited Loading

cknowles commented Feb 25, 2018

mumoshu commented Feb 25, 2018

cknowles commented Feb 25, 2018

mumoshu commented Feb 25, 2018

cknowles commented Feb 25, 2018

mumoshu commented Feb 25, 2018

thaJeztah commented Mar 8, 2018

jorge07 commented Mar 25, 2018

fejta-bot commented Apr 23, 2019

fejta-bot commented May 23, 2019

fejta-bot commented Jun 22, 2019

k8s-ci-robot commented Jun 22, 2019

mumoshu commented Feb 23, 2018 •

edited

Loading