docker ps hangs #965

rwky · 2015-11-04T19:56:47Z

After a while of running various containers for no apparent reason docker ps hangs. All other docker commands work

Docker info:

Server:
Version: 1.8.3
API version: 1.20
Go version: go1.5.1
Git commit: cedd534-dirty
Built: Fri Oct 16 04:20:25 UTC 2015
OS/Arch: linux/amd64

Containers: 23
Images: 520
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.2-coreos
Operating System: CoreOS 835.1.0
CPUs: 8
Total Memory: 31.46 GiB
ID: KOW4:OCYI:PTET:G2F7:BPHH:Y5GU:XMXT:U2D3:5SKA:MO63:SE2Y:I42C

I ran strace docker ps via toolbox to see if there's a clue of what's going on and it produced

socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/journal/socket"}, 30) = -1 ENOENT (No such file or directory)
close(3)                                = 0
ioctl(0, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
stat("/root/.docker/config.json", 0xc82008a240) = -1 ENOENT (No such file or directory)
stat("/root/.dockercfg", 0xc82008a2d0)  = -1 ENOENT (No such file or directory)
futex(0xc8201a2110, FUTEX_WAKE, 1)      = 1
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
setsockopt(3, SOL_SOCKET, SO_BROADCAST, [1], 4) = 0
connect(3, {sa_family=AF_LOCAL, sun_path="/media/root/var/run/docker.sock"}, 34) = 0
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=172969648, u64=140205085380272}}) = 0
getsockname(3, {sa_family=AF_LOCAL, NULL}, [2]) = 0
getpeername(3, {sa_family=AF_LOCAL, sun_path="/var/run/docker.sock"}, [23]) = 0
futex(0xc8201a2110, FUTEX_WAKE, 1)      = 1
read(3, 0xc8200ab000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(3, "GET /v1.20/containers/json HTTP/"..., 93) = 93
epoll_wait(4, {{EPOLLOUT, {u32=172969648, u64=140205085380272}}}, 128, 0) = 1
futex(0x181f250, FUTEX_WAIT, 0, NULL

So it looks like it's writing the get command but the server isn't responding.

The text was updated successfully, but these errors were encountered:

crawford · 2015-11-05T16:07:26Z

@yifan-gu you were showing me this behavior the other day, but I forget what was causing it. Can you refresh my memory?

yifan-gu · 2015-11-05T19:15:58Z

@crawford @rwky
I reported in this docker issue.
moby/moby#13885 (comment)

spiffxp · 2015-11-23T17:37:32Z

Running into this as well, see kubernetes/kubernetes#17525

mischief · 2015-11-23T18:27:28Z

@rwky have you experienced this after your initial report?

rwky · 2015-11-23T19:20:02Z

@mischief yep at least once a week.

mischief · 2015-11-24T00:48:28Z

@rwky any tips on reproducing it? is it intermittent, or are you doing something to work around it?

rwky · 2015-11-24T09:26:53Z

I don't know how to predictably replicate it, however the server in question spawns and destroys (using the --rm flag) several docker containers every minute. I suspect that just running a CoreOS server that does that for a few days will trigger the hang.

The only way to resolve is to reboot the server (you could probably kill and restart docker but rebooting does the job).

pierreozoux · 2015-11-26T13:09:28Z

I'm seeing the same problem now (I'm on IRC, and will post the content of IRC here), I'll have to reboot the server in 2 hours, if in the mean time, you want me to run commands, let me know!
(I run CoreOS stable)

Ok, I'm rebooting now, here is the IRC discussion: https://framabin.org/?3b8e73977c597b06#7wIlBFnWRaGO0+o34QkfauXwhm2kSvbk+NInpxe6kes=

leonty · 2016-02-19T13:46:31Z

Same for me on docker 1.10.1 after intensive using of dockerui web app (https://hub.docker.com/r/crosbymichael/dockerui/).

vpal · 2016-02-25T14:59:34Z

I experienced similar hang with CoreOS stable (835.12.0).
I could not kill -9 the docker daemon or any of its processes.
Had to reboot the machine.

Okay, I have no idea if this is the same issue, but it is reproducible with this command:
/usr/bin/docker run --pid=host --privileged -v /etc/kubernetes/manifests:/etc/kubernetes/manifests -v /var/run/docker.sock:/var/run/docker.sock --rm --name kubelet gcr.io/google_containers/hyperkube:v1.1.8 /hyperkube kubelet --v=2 --config=/etc/kubernetes/manifests

gopinatht · 2016-03-02T08:13:09Z

+1 for a rapid resolution of this issue on CoreOS 970.1.0 with Docker 1.10.1 with kubernetes 1.1.7. Is there an active investigation of the problem? There was a suggestion that it might be a kernel issue (moby/moby#13885 (comment)) but I am not sure how it is addressed or should be addressed in CoreOS.

crawford · 2016-03-02T14:17:21Z

@gopinatht can you give the Beta channel a shot? That has the upstream fix for device mapper.

gopinatht · 2016-03-02T14:49:55Z

@crawford Thanks for the info. I will give it a try.

Just to confirm, the alpha channel does not have the fix (I am using the absolute latest version of Alpha)? Also, docker info output says the storage driver is overlay:
Containers: 26 Running: 10 Paused: 0 Stopped: 16 Images: 12 Server Version: 1.10.1 Storage Driver: overlay Backing Filesystem: extfs Execution Driver: native-0.2 Logging Driver: json-file Plugins: Volume: local Network: host bridge null Kernel Version: 4.4.1-coreos Operating System: CoreOS 970.1.0 (Coeur Rouge) OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 3.862 GiB Name: ip-1x-xx-xx-xx.ec2.internal ID: UVON:7XXW:5AZD:QJTG:2DAW:6Y52:CKN3:EVZZ:WESO:COSQ:4UGC:VKNO

crawford · 2016-03-02T15:12:32Z

Sorry, I missed the CoreOS version before. That kernel also has the patch. I doubt it will behave any differently.

gopinatht · 2016-03-02T16:32:44Z

@crawford No worries. Any other insights on where I can investigate? I am kind of stuck on what I should do next. This issue has become a blocker for my team using a CoreOS/kubernetes/docker solution.

gopinatht · 2016-03-02T20:00:39Z

Guys, I logged an issue with Docker team and they are indicating that the docker build on CoreOS is different enough (different Go version and custom build on CoreOS) (moby/moby#20871 (comment)) for this to be more of a CoreOS issue.

Help or pointers from CoreOS team would be appreciated.

crawford · 2016-03-03T01:37:00Z

I really don't know where to start on this one. We are going to bump to Docker 1.10.2 in the next Alpha and then bump to Go 1.5.3 in the following one. So, by this time next week, we will have a mostly vanilla Docker shipping in CoreOS. Given how long this bug has been open, I don't expect the version bump to change anything, but at least we'll be able to debug with the Docker folks a bit easier. Let's reevaluate the situation next week.

vpal · 2016-03-03T09:51:42Z

To help debugging. If I add the -v /sys:/sys:ro option to the command I mentioned above the kubelet container and docker daemon does not hang.
So the full command that does not seem to hang:
/usr/bin/docker run --pid=host --privileged -v /sys:/sys:ro -v /etc/kubernetes/manifests:/etc/kubernetes/manifests -v /var/run/docker.sock:/var/run/docker.sock --rm --name kubelet gcr.igoogle_containers/hyperkube:v1.1.8 /hyperkube kubelet --v=2 --config=/etc/kubernetes/manifests

gopinatht · 2016-03-03T19:21:00Z

@vpal Thanks for this info. My system is a little different in that I am not using hyperkube on my AWS cluster. But I wonder why mounting /sys as read only mount should solve the docker hang issue. I am not able to make the connection.

gopinatht · 2016-03-04T06:37:16Z

@crawford I investigated the issue further and posted the results of my investigation there: moby/moby#20871 (comment)

FYI.

gopinatht · 2016-03-04T19:19:19Z

I asked this question on the docker bug page but thought it would be relevant here as well: Could this issue be related in any way to the hairpin NAT setting? Is is possible there is some sort of thread deadlocks on a device? When I ran systemctl on my CoreOS distribution, it did not show any device with name vethbfe8f8f. Could that be the source of the issue?

crawford · 2016-03-13T17:10:24Z

@gopinatht as of 983.0.0, we are shipping Docker built with Go 1.5.3, so that should eliminate a few variables.

gopinatht · 2016-03-14T15:18:53Z

@crawford Thanks for the heads-up. I will check it out.

abhisheyke · 2016-06-10T15:52:09Z

I came across same issue with docker 1.8.3. I noticed docker built with Go Version 1.4.2 .
is it a known bug in docker 1.8.3 ?
@gopinatht any luck using docker built with higher go version? Thanks

[root@abhi ]# docker version
Client:
Version: 1.8.3
API version: 1.20
Go version: go1.4.2
Git commit: f4bf5c7
Built: Mon Oct 12 06:06:01 UTC 2015
OS/Arch: linux/amd64

Server:
Version: 1.8.3
API version: 1.20
Go version: go1.4.2
Git commit: f4bf5c7
Built: Mon Oct 12 06:06:01 UTC 2015
OS/Arch: linux/amd64
[root@localhost abhi]# docker info
Containers: 20
Images: 324
Storage Driver: devicemapper
Pool Name: docker-8:7-268591717-pool
Pool Blocksize: 65.54 kB
Backing Filesystem: xfs
Data file: /dev/vg-abhi/data
Metadata file: /dev/vg-abhi/metadata
Data Space Used: 16.12 GB
Data Space Total: 107.4 GB
Data Space Available: 91.25 GB
Metadata Space Used: 26.04 MB
Metadata Space Total: 5.369 GB
Metadata Space Available: 5.343 GB
Udev Sync Supported: true
Deferred Removal Enabled: false
Library Version: 1.02.93-RHEL7 (2015-01-28)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.10.0-229.el7.x86_64
Operating System: CentOS Linux 7 (Core)
CPUs: 48
Total Memory: 125.6 GiB
Name: localhost.localdomain
ID: K76J:2RP6:HKQX:NMJS:AA5D:EXY2:VGGD:5YB4:KRFI:3HW6:WWG4:6QDY
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

crawford · 2016-06-12T19:30:27Z

Is this still an issue with 1.10.3? I haven't seen this one in a while (in fact, I forgot about this bug).

rwky · 2016-06-13T14:56:52Z

@crawford I've not seen this on 1.10.3 so far as I can remember, the server I normally see it on runs the beta channel so has been on 1.10.3 for a while.

Just checked my logs, not seen it since May 3rd.

crawford · 2016-06-13T20:17:56Z

@rwky Thank you for checking. Let's go ahead and close this one. We can reopen it if it turns out to still be a problem.

feelobot · 2016-07-13T12:48:55Z

seeing this still:

core@ip-172-20-13-165 ~ $ docker info
Containers: 50
 Running: 42
 Paused: 0
 Stopped: 8
Images: 241
Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.5.7-coreos
Operating System: CoreOS 1010.6.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 29.44 GiB
Name: ip-172-20-13-165.ec2.internal
ID: SS63:BOHR:326B:NZUQ:BOMB:43U2:XOLL:UCT7:BDOS:LBD3:DQJK:O52K

zihaoyu · 2016-07-30T02:54:55Z

We see this quite often on 1068.6.0 and Kubernetes 1.3.3. Please consider reopening this issue.

core@ip-10-72-16-158 ~ $ docker info
Containers: 135
 Running: 77
 Paused: 0
 Stopped: 58
Images: 59
Server Version: 1.10.3
Storage Driver: overlay
 Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.6.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 157.4 GiB
Name: ip-10-72-16-158.us-west-2.compute.internal
ID: F6BA:V7OB:IYIQ:QGDZ:GQCP:2ZUW:A7S7:GE73:AG6Z:SLND:CFKU:A7R3

Also seeing these a lot in journal -u docker.service:

Jul 30 00:40:29 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:40:29.198751686Z" level=info msg="Container bb8b577470e1ec6be491414d7d39df767a113e4fb6efbc3178c7b73ea14ff15e failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470564310Z" level=info msg="Container d5677abe86893968f12f1d3db15ef0097dbd218d59c52e4c06d809d8c0205664 failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470599085Z" level=info msg="Container 58b3ac90b2fef33b7cb76731c2df1d56d5459be0e7f0312169e6d47b599f6d1d failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:09 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:09.470564635Z" level=info msg="Container 85d9338bddc0553f079651c737fa3c165df53ffe35a852e579a11eb1732316c0 failed to exit within 30 seconds of SIGTERM - using the force"
Jul 30 00:41:10 ip-10-72-16-158.us-west-2.compute.internal dockerd[43231]: time="2016-07-30T00:41:10.627024947Z" level=error msg="Handler for POST /containers/36b70ef81bd78b765f58245a49883b4e11b3ab291ce4e67535c4e9314921af10/stop returned error: Container already stopped"

rwky · 2016-08-04T20:37:13Z

This just happened to me again, however at the same time I'm experiencing moby/moby#5618 which I suspect is the cause @feelobot @zihaoyu perhaps you are experiencing the same thing?

zihaoyu · 2016-08-04T20:55:07Z

@rwky Not sure if it is related, but I did see

Aug 04 11:10:00 ip-10-72-16-158.us-west-2.compute.internal kernel: unregister_netdevice: waiting for veth9c879be to become free. Usage count = 1```

rwky · 2016-08-04T20:58:06Z

Yep that looks like it's the problem.

nizar-m · 2016-08-09T15:43:36Z

This is happening for me too.

Docker version 1.10.3, build 3cd164c

CoreOS stable (1068.8.0)

Kernel Version: 4.6.3-coreos
OS Image: CoreOS 1068.8.0 (MoreOS)
Container Runtime Version: docker://1.10.3
Kubelet Version: v1.2.4.7+cb08580860a02a-dirty
Kube-Proxy Version: v1.2.4.7+cb08580860a02a-dirty

docker info

Containers: 131
Running: 90
Paused: 0
Stopped: 41
Images: 17
Server Version: 1.10.3
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
Volume: local
Network: null host bridge
Kernel Version: 4.6.3-coreos
Operating System: CoreOS 1068.8.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 13.7 GiB

dmesg shows the folowing

[163194.652062] unregister_netdevice: waiting for vethdd22a32 to become free. Usage count = 1

We are using Flannal driver

alph486 · 2016-08-31T17:57:58Z

Echo @rwky: reported details here: moby/moby#5618 (comment) . I'm probably going to try @sercand 's suggestion (moby/moby#5618 (comment))

It should probably also be mentioned that this of course affects, by default, clusters build by the kube-aws project.

mischief · 2016-08-31T18:05:14Z

sounds like a dup of #254

liquid-sky · 2016-11-08T11:00:28Z

@mischief, doesn't look like a dup, because we run container the same way as @rwky (every minute with --rm) and I can see that docker hangs exactly on the container, long before unregistered lo message happens. On some machines there's no unregistered lo message at all, but docker ps is hanging.

docker -D info
Containers: 6
 Running: 2
 Paused: 0
 Stopped: 4
Images: 6
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: host bridge null
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.8 GiB
Name: <hostname>
ID: GFHQ:Y6IV:XCQI:Y2NA:PCQN:DEII:OSPZ:LENL:OCFU:6CBI:MDFV:EMD7
Docker Root Dir: /var/lib/docker
Debug mode (client): true
Debug mode (server): false
Username: <username>
Registry: https://index.docker.io/v1/

matti · 2016-11-09T10:05:35Z

I don't have unregister_netdevice in dmesg, but docker ps hangs like this:

$ docker ps -n 17
# works

$ docker ps -n 18
# hangs

strace:

$ strace -s  128 docker ps -n 18
clock_gettime(CLOCK_MONOTONIC, {484793, 504192514}) = 0
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
read(5, 0xc820400000, 4096)             = -1 EAGAIN (Resource temporarily unavailable)
write(5, "GET /v1.23/containers/json?limit=18 HTTP/1.1\r\nHost: \r\nUser-Agent: Docker-Client/1.11.2 (linux)\r\n\r\n", 98) = 98
futex(0xc820064108, FUTEX_WAKE, 1)      = 1
futex(0x22509c8, FUTEX_WAIT, 0, NULL

so $ curl --unix-socket /var/run/docker.sock http://localhost/v1.23/containers/json?limit=17 works, but curl --unix-socket /var/run/docker.sock http://dolantrump/v1.23/containers/json?limit=18 hangs.

Coreos: VERSION=1185.3.0 BUILD_ID=2016-11-01-0605

$ docker info
Containers: 52
 Running: 28
 Paused: 0
 Stopped: 24
Images: 110
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 4.7.3-coreos-r2
Operating System: CoreOS 1185.3.0 (MoreOS)
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 29.46 GiB
Name: gcp-1-5.c.kontena-1162.internal
ID: CTTM:L5IQ:K24M:GNOZ:JOMO:CZEG:Z756:EI3M:ZGPS:HYPO:SXVO:GIO5
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/

running in google cloud platform.

crawford · 2016-11-09T18:52:16Z

Since this is a new bug, let's move this conversation over to #1654.

klawed · 2017-01-06T15:42:27Z

systemctl restart docker
totally worked for me (which is definitely nicer than a reboot).

crawford added the component/docker label Nov 5, 2015

crawford added kind/bug priority/P1 team/os area/usability kind/friction labels Nov 23, 2015

gopinatht mentioned this issue Mar 2, 2016

docker ps command hangs on CoreOS with Docker version 1.10.1 and overlay storage driver moby/moby#20871

Closed

alexkappa mentioned this issue Mar 12, 2016

Kubernetes workers fail after a few hours coreos/coreos-kubernetes#306

Closed

crawford closed this as completed Jun 13, 2016

docker ps hangs #965

docker ps hangs #965

Comments

rwky commented Nov 4, 2015

crawford commented Nov 5, 2015

yifan-gu commented Nov 5, 2015

spiffxp commented Nov 23, 2015

mischief commented Nov 23, 2015

rwky commented Nov 23, 2015

mischief commented Nov 24, 2015

rwky commented Nov 24, 2015

pierreozoux commented Nov 26, 2015

leonty commented Feb 19, 2016

vpal commented Feb 25, 2016

gopinatht commented Mar 2, 2016

crawford commented Mar 2, 2016

gopinatht commented Mar 2, 2016

crawford commented Mar 2, 2016

gopinatht commented Mar 2, 2016

gopinatht commented Mar 2, 2016

crawford commented Mar 3, 2016

vpal commented Mar 3, 2016

gopinatht commented Mar 3, 2016

gopinatht commented Mar 4, 2016

gopinatht commented Mar 4, 2016

crawford commented Mar 13, 2016

gopinatht commented Mar 14, 2016

abhisheyke commented Jun 10, 2016

crawford commented Jun 12, 2016

rwky commented Jun 13, 2016 • edited Loading

crawford commented Jun 13, 2016

feelobot commented Jul 13, 2016

zihaoyu commented Jul 30, 2016 • edited Loading

rwky commented Aug 4, 2016

zihaoyu commented Aug 4, 2016

rwky commented Aug 4, 2016

nizar-m commented Aug 9, 2016

alph486 commented Aug 31, 2016 • edited Loading

mischief commented Aug 31, 2016

liquid-sky commented Nov 8, 2016

matti commented Nov 9, 2016

crawford commented Nov 9, 2016

klawed commented Jan 6, 2017

rwky commented Jun 13, 2016 •

edited

Loading

zihaoyu commented Jul 30, 2016 •

edited

Loading

alph486 commented Aug 31, 2016 •

edited

Loading