Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Pods Terminating forever due to Docker 17.09-ce Bug #1135

Closed
jorge07 opened this issue Feb 15, 2018 · 18 comments
Closed

Pods Terminating forever due to Docker 17.09-ce Bug #1135

jorge07 opened this issue Feb 15, 2018 · 18 comments
Labels
documentation kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jorge07
Copy link
Contributor

jorge07 commented Feb 15, 2018

We've a cluster created with v0.9.9, k8s 1.8.4, and docker version 17.09.1-ce

Steps to Reproduce

Enter in a host and run:

docker run -it ubuntu /bin/bash
root@943b8935e38e:/# exit
exit

[nothing happens]

docker ps still displaying the ubuntu container

Related

moby/moby#33820

Looks like the docker version is not compatible with k8s yet.

How can I found the correct coreOS AMI with the validated docker version 17.03.2?

Thanks in advance

@cknowles
Copy link
Contributor

Glad someone else reproduced this. I’ve also tried k8s 1.9.x and so far unable to get a stable dev cluster partly due to this problem.

@jorge07
Copy link
Contributor Author

jorge07 commented Feb 15, 2018

coreOS releases jumps docker version form 1.12 to 17.09. So 17.03, the only one validated for k8s 1.8 its not available

@jorge07
Copy link
Contributor Author

jorge07 commented Feb 16, 2018

Downgrade to 1.12 fix that issue.

-#amiId: ""
 +amiId: "ami-1a7de360"

@jorge07
Copy link
Contributor Author

jorge07 commented Feb 20, 2018

Related with #1106

@cknowles
Copy link
Contributor

I was looking at this further for some time but I didn't get far in diagnosing the issue. I know @mumoshu has been running at least a dev cluster on the latest container linux AMI with docker 17.09 and that's been stable. For the last month or so I've been unable to gain a healthy cluster in dev that I can prompt to other environments. I'm considering also downgrading for now.

Related comments:
#728 (comment)
#941 (comment)

@mumoshu
Copy link
Contributor

mumoshu commented Feb 23, 2018

So I basically believe that:

  • We should just recommend users to stick to old AMIs or use custom AMI to use the old docker forK8S 1.8.x
  • While recommending users to just use K8S 1.9.x

And for @c-knowles's case, we should just make sure K8S 1.9.x is compatible with the newer docker?

@c-knowles Would you mind sharing a detailed configuration of your cluster, so that we could probably diagnose the root cause(s) together with us and folks from upstream?

@mumoshu mumoshu added kind/bug Categorizes issue or PR as related to a bug. documentation labels Feb 23, 2018
@cknowles
Copy link
Contributor

@mumoshu sure I'd be happy to diagnose with some help. My dev setup is here, the only bit I simplified was etcd to remove some custom units which install datadog (etcd seems stable anyway).

If anyone is interested in the setup for packer, this is the provisioner you need:

{
    "type": "shell",
    "inline": [
      "sudo mkdir -p /etc/coreos",
      "echo yes | sudo tee /etc/coreos/docker-1.12",
      "sudo systemctl stop docker.service",
      "sudo rm -rf /var/lib/docker"
    ]
}

@mumoshu
Copy link
Contributor

mumoshu commented Feb 25, 2018

@c-knowles Thx for sharing! I did read thorough your config. I suspect your t2.medium nodes are loaded too much workloads.
Can you take notes of the instance type of node encountered that docker hang thing from now?
Also, would you mind trying one with more consistent cpu, c4.large or m4.large at minimum? t2.medium goes crazy as soon as you do something more than running toy pods. I use them only for testing kube-aws's cluster provisioning process.

@cknowles
Copy link
Contributor

@mumoshu I can try to roll another cluster side by side with this one as I've already changed it back to docker 1.12. The same node config works fine on docker 1.12 by the way. For workloads, the only thing this cluster is running are the pods that kube-aws creates plus a few nginx/traefik containers that aren't receiving much if any traffic (I'm acceptance testing some basic deploys/helm charts).

@mumoshu
Copy link
Contributor

mumoshu commented Feb 25, 2018

Thx! Then, my best guest at the moment is the newer docker somehow consumes a little bit more cpu than before/has some race-related issue triggered when low cpu.

Anyway, that was only thing I could read from the config.

If it isn't t2 issue at all, I guess asking docker devs for more assistance would be the only way.

@cknowles
Copy link
Contributor

@mumoshu not sure if you spotted the minSize: 0 on two worker pools? I keep the cluster config aligned between environments as much as I can meaning same number of node pools. This dev cluster has a spot fleet but the workers stay pretty consistent at 1 x t2.medium on demand and 1 x c4.xlarge spot. Both the single controller and etcd nodes are m3.medium.

@mumoshu
Copy link
Contributor

mumoshu commented Feb 25, 2018

Thx! I didn't realize that but probably it
does't affect my guess? Your only t2.medium node seem likely to get crazy once cpu credit goes to zero.
Have you watched remaining cpu credit count when you saw docker issue btw?

@thaJeztah
Copy link

This looks related to moby/moby#36048 / moby/moby#36010, which was a bug in RunC (see opencontainers/runc#1698); that bug had been around for a long time, but wasn't triggered until the Meltdown/Spectre patches began to roll out

@jorge07
Copy link
Contributor Author

jorge07 commented Mar 25, 2018

I made a PR in order to make AmiID required in cluster.yaml #1201. It will not fix the issue but will prevent random ami updates in cluster update.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 23, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 23, 2019
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants