docker ps hangs #1654

crawford · 2016-11-09T18:51:57Z

Issue Report

Bug

CoreOS Version

1185.3.0

Environment

AWS and GCE confirmed. Likely all environments.

Expected Behavior

docker ps properly returns the list of running containers.

Actual Behavior

docker ps eventually hangs (not sure which conditions cause it yet).

Reproduction Steps

In @matti's case:

$ docker ps -n 17
# works

$ docker ps -n 18
# hangs

Other Information

strace:

$ strace -s  128 docker ps -n 18
clock_gettime(CLOCK_MONOTONIC, {484793, 504192514}) = 0
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
read(5, 0xc820400000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json?limit=18 HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 98) = 98
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL

The text was updated successfully, but these errors were encountered:

crawford · 2016-11-09T21:06:23Z

@chancez pointed out that moby/moby#13885 and moby/libnetwork#1507 might be related.

matti · 2016-11-10T10:17:51Z

We run 5 machines and get this hang ~weekly (for the last 3 months). Is there anything we could run on the machines when the hang happens again?

matti · 2016-11-10T10:30:02Z

machines are 8CPU/30gb.

During the last hang we were able to start new containers while docker ps was hanging.

Jwpe · 2016-11-10T11:32:16Z

I am also experiencing this issue with the following setup:

CoreOS Version: 1185.3.0
Docker Server Version: 1.11.2

Not only does docker ps hang, but docker-compose commands fail with the following error message:

Nov 10 10:59:20 <aws_dns>.eu-west-1.compute.internal docker-compose[5872]: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
Nov 10 10:59:20 <aws_dns>.eu-west-1.compute.internal docker-compose[5872]: If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

In the Docker service logs these are the only errors I can see:

Nov 10 09:38:36 <aws_dns>.eu-west-1.compute.internal dockerd[972]: time="2016-11-10T09:38:36.701843769Z" level=error msg="attach: stdout: write unix /var/run/docker.sock->@: write: broken pipe"
Nov 10 09:38:36 <aws_dns>.eu-west-1.compute.internal dockerd[972]: time="2016-11-10T09:38:36.704819504Z" level=error msg="attach: stderr: write unix /var/run/docker.sock->@: write: broken pipe"

I am able to temporarily resolve the issue by manually restarting docker.service.

matti · 2016-11-10T12:04:09Z

@Jwpe when this happens again, try docker ps -n 1, docker ps -n 2 ... until it hangs just to verify that the symptoms are the same.

docker-compose uses the same HTTP api so it will hang as well. So just to make clear the title for this bug is somewhat misleading -- docker ps is just good command to detect this bug because it iterates through all the containers in the API.

Jwpe · 2016-11-10T12:11:38Z

@matti thanks for the tip, will do!

Jwpe · 2016-11-10T12:24:32Z

@matti for me docker ps -n 3 works correctly, but docker ps -n 4 hangs...

matti · 2016-11-10T12:26:58Z

@Jwpe (or anyone else reading this) -- please send SIGUSR1 to the docker daemon process and paste in the logs.

Jwpe · 2016-11-10T12:41:47Z

@matti I get a huge goroutine stack dump on sending that signal:

ps -C docker
  PID TTY          TIME CMD
20791 ?        00:00:01 docker
sudo kill -s USR1 20791
journalctl -u docker -n 100

Nov 10 12:32:08 <aws_dns>.eu-west-1.compute.internal dockerd[20791]: time="2016-11-10T12:32:08.436895136Z" level=info msg="=== BEGIN goroutine stack dump ===\ngoroutine 20 [running]:\n
...

matti · 2016-11-10T13:26:17Z

@Jwpe yes... please paste the rest of it...

@juhazi over here formatted our stacktrace here: https://gist.github.com/juhazi/fbf22602561b719e9480f8be4f8a4740

matti · 2016-11-10T13:29:08Z

and for the record: this time all daemon API's were jammed:

getpeername(5, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc820024d08, FUTEX_WAKE, 1)      = 1
read(5, 0xc8203d6000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 89) = 89
futex(0xc820024d08, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL

basvdlei · 2016-11-10T16:34:40Z

We are also seeing this behavior on some 1185.3.0 nodes (on VMWare ESXi) where we are using systemd units with the docker kill, docker rm, docker pull ExecStartPre's as described here.

Here are some goroutine stack dump's of 3 different machines that had the issue that docker ps didn't respond unless given a lower -n value.

In our case the units also created the following error messages:

Handler for POST /v1.23/containers/create returned error: Conflict. The name \"/xxxxxxxxxxx\" is already in use by container 0bf826dafe929711c98d65fa812ed75c4086dc3075e1c148fd1ebfd5c28b0544. You have to remove (or rename) that container to be able to reuse that name."
Handler for POST /v1.23/containers/xxxxxxxxxxx/stop returned error: No such container: xxxxxxxxxxx"
Handler for POST /v1.23/containers/xxxxxxxxxxx/kill returned error: Cannot kill container xxxxxxxxxxx: No such container: xxxxxxxxxxx"
Handler for DELETE /v1.23/containers/xxxxxxxxxxx returned error: No such container: xxxxxxxxxxx"

Even though the stop/kill/rm said the container was not there. Which looks a lot like moby/moby#21198 which might be related

Just to be complete, we are also running cadvisor and a custom container that calls list containers about every 5min.

matti · 2016-11-10T20:35:21Z

We are also running cadvisor (as the cluster runs http://www.kontena.io)

tom-pryor · 2016-11-11T13:15:42Z

Having same issue since upgrading to 1185.3.0, here are the logs: https://gist.github.com/anonymous/0ad15cacc028c38ff7759abba7ace198

bkleef · 2016-11-14T22:42:48Z

Same here at Digital Ocean. Let me know if you need more logs, happy to help.

$ docker version
Client:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.6.3
 Git commit:   bac3bae
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.11.2
 API version:  1.23
 Go version:   go1.6.3
 Git commit:   bac3bae
 Built:
 OS/Arch:      linux/amd64

$ uname -a
Linux 4.7.3-coreos-r2 #1 SMP Tue Nov 1 01:38:43 UTC 2016 x86_64 Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz GenuineIntel GNU/Linux

$ cat /etc/lsb-release
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=1185.3.0
DISTRIB_CODENAME="MoreOS"
DISTRIB_DESCRIPTION="CoreOS 1185.3.0 (MoreOS)"

bkleef · 2016-11-14T23:45:16Z

I've updated CoreOS 1185.3.0 to 1192.2.0 (Docker 1.11.2 to 1.12.1). Will keep you guys posted!

Jwpe · 2016-11-16T17:17:05Z

Has anyone had any joy here? I'm seeing this every two or three restarts of my containers, and it's currently limiting my ability to repeatably deploy.

victorgp · 2016-11-20T22:49:41Z

We had to roll back to 1122.3.0 and it is working

Version 1185.3.0 it is not usable, every ~24hours the docker daemon gets unresponsive.

matti · 2016-11-21T13:49:13Z

Daemon jammed again, this time not even docker ps -n 1 worked.

Jwpe · 2016-11-23T09:31:26Z

@bkleef are you seeing this issue still on 1192.2.0? Thanks!

bkleef · 2016-11-23T10:27:45Z

@Jwpe yes it is fixed at 1192.2.0!

Raffo · 2016-11-29T18:39:45Z

@Jwpe It's not fixed for us in 1192.2.0.

victorgp · 2016-11-29T23:54:07Z

I don't think that bug is related, it was opened in Jun 2015 and this started to happen recently with the latests CoreOS versions. If that's the bug, then it should've happened in older CoreOS versions

wallneradam · 2016-11-30T00:08:49Z

@victorgp It is almost sure that it is related if you read all the comments or at least the latest ones. Just it is more often under Docker v1.11.0 and up. Which is only in the latest CoreOS (stable) not the previous one.
Not only docker ps hangs, but you cannot start or stop containers as well.

Anyway my suggestion is working for me for about a week. And it was not too hard to rewrite my services, which are needed to run processes or new containers, to ctr, which is integrated into CoreOS...
(The ctr utility is the CLI interface for containerd, which Docker uses as well behind the scene from v1.11.0).

Jwpe · 2016-11-30T09:23:49Z

@Raffo I haven't encountered the issue since upgrading to 1192.2.0, but I might just not have recreated the scenario where it occurs yet.

philk · 2016-12-01T21:16:29Z

We're also seeing this on 1185.3.0. At least on the one system it's currently happening on -n9 works but -n10 fails. The ctr command can communicate with things but this failing docker daemon breaks the kubelet making this a completely failed node. Specifically pods get stuck in Pending/ContainerCreating/Terminating.

hjacobs · 2016-12-05T08:30:23Z

We had a hanging docker ps as well (moby/moby#28889), we could reproduce it by heavily using docker exec. We fixed the issue for us by manually upgrading to Docker 1.13 (zalando-incubator/kubernetes-on-aws#164).

We are now waiting for Docker 1.13 release and inclusion in CoreOS: zalando-incubator/kubernetes-on-aws#167

seanknox · 2016-12-06T05:57:22Z

I was also able to reproduce this issue using https://github.com/crosbymichael/docker-stress on a Kubernetes worker node running CoreOS Stable 1185.3.0.

Running docker_stress_linux_amd64 -k 3s -c 5 --containers 1000: 5 concurrent workers creating/deleting containers, max lifetime of containers = 3s, create up to 1000 containers on an m4.large instance on AWS would leave the Docker daemon unresponsive after about three minutes.

Upgraded to CoreOS Beta 1235.1.0 and I haven't been able to reproduce. Whereas running 5 concurrent docker_stress workers would kill CoreOS Stable after a few minutes, I was able to run with 10 and 15 concurrent workers until test completion using CoreOS Beta.

CoreOS Stable 1185.3.0

kernel: 4.7.3

docker: 1.11.2

CoreOS Beta 1235.1.0

kernel: 4.8.6

docker: 1.12.3

mward29 · 2016-12-07T15:45:23Z

Just as an FYI, I'm not entirely sure this is CoreOS specific. We are running into the same issues on CentOS. Its feels more relates to Docker atm but I've been unable to get a dump out of Docker on failure to validate.

hjacobs · 2016-12-07T18:53:30Z

@mward29 yes, it's a Docker bug and fixed in 1.13 (and will be backported to 1.12 AFAIK, see moby/moby#28889) --- still it should be fixed in CoreOS (by upgrading to fixed Docker as soon as it's released).

dm0- · 2016-12-14T01:41:47Z

We are updating to Docker 1.12.4, which contains the upstream fixes for the docker daemon deadlocks (moby/moby#29095, moby/moby#29141). It will be available in the alpha later this week. You can reopen this if problems persist.

Bekt · 2017-01-10T19:57:27Z

This kept happening to me at about 10% of the time. For non-alpha releases, I solved this by adding a unit that constantly checks docker version and restarts if it hangs.

- name: ping-docker.service
  command: start
  content: |
    [Unit]
    Description=Make sure docker is always running.
    [Service]
    ExecStart=/usr/bin/bash -c "(timeout 2s docker ps > /dev/null) || (systemctl restart docker)"
    Restart=always
    RestartSec=10

Edit: s/version/ps

MartinPyka · 2017-03-06T18:22:20Z

How do you stop the stdout of docker logs?
Would love to have reliable method to reproduce the error

krmayankk · 2017-07-28T20:03:52Z

@hjacobs i am seeing this on 1.12.6 Is this fixed, if yes in which version ?
@dm0- you seem to suggest that its fixed in 1.12.4 , is that right ? since i am seeing in 1.12.6.

docker version
Client:
Version: 1.12.6
API version: 1.24
Go version: go1.6.4
Git commit: 78d1802
Built: Tue Jan 10 20:20:01 2017
OS/Arch: linux/amd64

Server:
Version: 1.12.6
API version: 1.24
Go version: go1.6.4
Git commit: 78d1802
Built: Tue Jan 10 20:20:01 2017
OS/Arch: linux/amd64

strace for docker ps

futex(0xc820062908, FUTEX_WAKE, 1)      = 1
futex(0xc820062908, FUTEX_WAKE, 1)      = 1
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 4
setsockopt(4, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(4, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, 23) = 0
epoll_create1(EPOLL_CLOEXEC)            = 5
epoll_ctl(5, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3597498464, u64=140156970301536}}) = 0
getsockname(4, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(4, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc820062908, FUTEX_WAKE, 1)      = 1
read(4, 0xc820349000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(4, "GET /v1.24/containers/json HTTP/"..., 95) = 95
futex(0xc820062d08, FUTEX_WAKE, 1)      = 1
futex(0x132ccc8, FUTEX_WAIT, 0, NULL^CProcess 119740 detached

piyushkv1 · 2017-08-10T11:48:17Z

I'm also seeing this in docker server version 1.12.6

deryadorian · 2017-08-21T13:20:22Z

sudo systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/run/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: inactive (dead)
Docs:

after docker ps:

systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/run/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2017-08-21 13:20:50 UTC; 2s ago
Docs: http://docs.docker.com
Main PID: 25815 (dockerd)

squeed · 2017-08-21T14:35:26Z

That seems reasonable; docker is a socket-activated service.

lucab · 2019-12-13T09:00:48Z

This ticket is unfortunately starting to attract spurious follow-ups not related to ContainerLinux. I'm thus going to lock this to prevent further unrelated comments.

crawford added area/usability component/docker kind/regression priority/P0 team/os version/docker/1.11.2 labels Nov 9, 2016

This was referenced Nov 9, 2016

docker ps hangs #965

Closed

Cannot start flanneld after rollback to 1122.3.0 (from 1185.3.0) #1652

Closed

crawford added this to the CoreOS Alpha 1263.0.0 milestone Dec 2, 2016

wkruse mentioned this issue Dec 5, 2016

Docker daemon hangs #1681

Closed

dm0- self-assigned this Dec 12, 2016

dm0- mentioned this issue Dec 14, 2016

app-emulation/docker: bump to v1.12.4 coreos/coreos-overlay#2317

Merged

dm0- closed this as completed Dec 14, 2016

youngbupark mentioned this issue Jan 5, 2017

segment fault in docker container. microsoft/OMS-Agent-for-Linux#300

Closed

schuylr mentioned this issue Jan 5, 2017

etcd Catalog Disaster Recovery is broken rancher/rancher#7310

Closed

This comment has been minimized.

Sign in to view

Bekt mentioned this issue Apr 3, 2017

Arbitrary docker daemon fail kubernetes/kops#2178

Closed

jmcarp mentioned this issue Aug 9, 2017

[WIP] Bump docker to 1.12.5. cloud-gov/kubernetes-release#65

Merged

This comment has been minimized.

Sign in to view

kevinkjt2000 mentioned this issue Jun 13, 2018

Need to investigate what "TLS verification failed" means kevinkjt2000/bowser#133

Closed

This comment has been minimized.

Sign in to view

coreos locked as off-topic and limited conversation to collaborators Dec 13, 2019

docker ps hangs #1654

docker ps hangs #1654

Comments

crawford commented Nov 9, 2016

Issue Report

Bug

CoreOS Version

Environment

Expected Behavior

Actual Behavior

Reproduction Steps

Other Information

crawford commented Nov 9, 2016

matti commented Nov 10, 2016

matti commented Nov 10, 2016

Jwpe commented Nov 10, 2016 • edited Loading

matti commented Nov 10, 2016 • edited Loading

Jwpe commented Nov 10, 2016

Jwpe commented Nov 10, 2016

matti commented Nov 10, 2016

Jwpe commented Nov 10, 2016

matti commented Nov 10, 2016

matti commented Nov 10, 2016

basvdlei commented Nov 10, 2016 • edited Loading

matti commented Nov 10, 2016

tom-pryor commented Nov 11, 2016 • edited Loading

bkleef commented Nov 14, 2016 • edited Loading

bkleef commented Nov 14, 2016

Jwpe commented Nov 16, 2016

victorgp commented Nov 20, 2016

matti commented Nov 21, 2016

Jwpe commented Nov 23, 2016

bkleef commented Nov 23, 2016 • edited Loading

Raffo commented Nov 29, 2016

victorgp commented Nov 29, 2016

wallneradam commented Nov 30, 2016 • edited Loading

Jwpe commented Nov 30, 2016

philk commented Dec 1, 2016 • edited Loading

hjacobs commented Dec 5, 2016

seanknox commented Dec 6, 2016

CoreOS Stable 1185.3.0

kernel: 4.7.3

docker: 1.11.2

CoreOS Beta 1235.1.0

kernel: 4.8.6

docker: 1.12.3

mward29 commented Dec 7, 2016

hjacobs commented Dec 7, 2016

dm0- commented Dec 14, 2016

Bekt commented Jan 10, 2017 • edited Loading

This comment has been minimized.

MartinPyka commented Mar 6, 2017 • edited Loading

krmayankk commented Jul 28, 2017 • edited Loading

piyushkv1 commented Aug 10, 2017

deryadorian commented Aug 21, 2017 • edited Loading

squeed commented Aug 21, 2017

This comment has been minimized.

This comment has been minimized.

lucab commented Dec 13, 2019

Jwpe commented Nov 10, 2016 •

edited

Loading

matti commented Nov 10, 2016 •

edited

Loading

basvdlei commented Nov 10, 2016 •

edited

Loading

tom-pryor commented Nov 11, 2016 •

edited

Loading

bkleef commented Nov 14, 2016 •

edited

Loading

bkleef commented Nov 23, 2016 •

edited

Loading

wallneradam commented Nov 30, 2016 •

edited

Loading

philk commented Dec 1, 2016 •

edited

Loading

Bekt commented Jan 10, 2017 •

edited

Loading

MartinPyka commented Mar 6, 2017 •

edited

Loading

krmayankk commented Jul 28, 2017 •

edited

Loading

deryadorian commented Aug 21, 2017 •

edited

Loading