Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker 1.12.2 docker ps hangs #27323

Closed
saithala opened this issue Oct 12, 2016 · 13 comments
Closed

Docker 1.12.2 docker ps hangs #27323

saithala opened this issue Oct 12, 2016 · 13 comments
Labels
area/daemon area/networking area/swarm priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Milestone

Comments

@saithala
Copy link

Description

After upgrading to the latest version of Docker Engine with Swarm mode (1.12.2) executing command docker ps hangs every now and then. Restarting the docker daemon solves the issue

@LK4D4
Copy link
Contributor

LK4D4 commented Oct 12, 2016

@saithala Can you try to send SIGUSR1 to dockerd process when it happens? It should send goroutine stack trace to logs. Pls, send us that stack then.

@aluzzardi
Copy link
Member

/cc @tonistiigi

@aluzzardi
Copy link
Member

This looks pretty awful

@alexvranceanu
Copy link

alexvranceanu commented Oct 14, 2016

I'm experiencing the same issue with 1.12.2-rc2 running on CentOS 7.2.1511, kernel 3.10.0-327.36.1.el7.x86_64 #1 SMP, with Swam active (this node is a manager):

The stack from Docker engine is attached.
stack.txt

Client:
 Version:      1.12.2-rc2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   ad9538a
 Built:
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.2-rc2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   ad9538a
 Built:
 OS/Arch:      linux/amd64```

@alexvranceanu
Copy link

alexvranceanu commented Oct 14, 2016

As an update, apparently the remote API is not responding to calls:

curl -A "Docker-Client/1.12.2-rc2 (linux)" -v http://localhost:2375/v1.24/info
* About to connect() to localhost port 2375 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 2375 (#0)
> GET /v1.24/info HTTP/1.1
> User-Agent: Docker-Client/1.12.2-rc2 (linux)
> Host: localhost:2375
> Accept: */*
>

@alexvranceanu
Copy link

Sounds like it might be related to #27272

@cpuguy83
Copy link
Member

Looks like a deadlock on the network controller.
Ping @mrjana

This seems like the relevent code holding the controller lock:

goroutine 450 [semacquire, 98 minutes]:
sync.runtime_Semacquire(0xc823b2525c)
    /usr/local/go/src/runtime/sema.go:47 +0x26
sync.(*Mutex).Lock(0xc823b25258)
    /usr/local/go/src/sync/mute
Oct 14 14:51:15 helios-ops-es2 dockerd: x.go:83 +0x1c4
github.com/docker/libnetwork.(*controller).cleanupServiceBindings(0xc8201c04b0, 0xc821cbe720, 0x19)
    /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/service_linux.go:46 +0xf3
github.com/docker/libnetwork.(*network).delete(0xc822dc4c80, 0xc822fb6700, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/network.go:770 +0xd14
github.com/docker/libnetwork.(*network).Delete(0xc822fb6640, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/network.go:722 +0x32
github.com/docker/docker/daemon.(*Daemon).deleteNetwork(0xc820159860, 0xc820d82800, 0x8, 0x8000101, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/network.go:371 +0x4d4
github.com/docker/docker/daemon.(*Daemon).DeleteManagedNetwork(0xc820159860, 0xc820d82800, 0x8, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/network.go:352 +0x46
github.com/docker/docker/daemon/cluster/executor/container.(*containerAdapter).removeNetworks(0xc820d81780, 0x7f555f0d45d0, 0xc820ce2c00, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/cluster/executor/container/adapter.go:107 +0x10c
github.com/docker/docker/daemon/cluster/executor/container.(*controller).Remove(0xc820d7a9c0, 0x7f555f0d45d0, 0xc820ce2c00, 0x0, 0x0)
    /root/rpmbuild/BUILD/docker-engine/.gopath/src/github.com/docker/docker/daemon/cluster/executor/container/controller.go:348 +0x280
github.com/docker/swarmkit/agent.(*taskManager).run(0xc820d86280, 0x7f555f0d45d0, 0xc820ce2c00)
    /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/task.go:225 +0x13b6
created by github.com/docker/swarmkit/agent.newTaskManager
    /root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/swarmkit/agent/task.go:35 +0x1a6

While the controller is locked it's stuck trying to acquire the lock on a "serviceBinding"
I'm not sure what this is.

@cpuguy83
Copy link
Member

Btw, the network controller is deadlocked, making any actions like start/stop hang while the container is locked, which makes commands like docker ps and docker inspect hang.

@aluzzardi aluzzardi added the priority/P1 Important: P1 issues are a top priority and a must-have for the next release. label Oct 14, 2016
@aluzzardi
Copy link
Member

Bumping to P1

@mrjana
Copy link
Contributor

mrjana commented Oct 14, 2016

It is a AB/BA deadlock between controller and service lock. Pushed a PR to fix it in moby/libnetwork#1507

mrjana added a commit to mrjana/docker that referenced this issue Oct 14, 2016
Fixes moby#27323

Signed-off-by: Jana Radhakrishnan <[email protected]>
@aluzzardi
Copy link
Member

@mrjana Can you share more details on the likelihood of the deadlock triggering?

Does this warrant a 1.12.3? /cc @vieux @thaJeztah

@mrjana
Copy link
Contributor

mrjana commented Oct 14, 2016

@aluzzardi It is indeed surprising that nobody hit this problem during 1.12.2-rc phase but the likelihood is entirely dependent on timing of two goroutines exactly to have acquired one of the locks and waiting on the other. In terms of possibility it is more likely to happen when there is more task failures which can start triggering cleanup concurrently when another task of the same service is trying to get started. May be that is why we haven't hit this because we haven't probably had too many task failures in the kind of testing that we have done.

@thaJeztah
Copy link
Member

I created a 1.12.3 milestone for tracking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon area/networking area/swarm priority/P1 Important: P1 issues are a top priority and a must-have for the next release. version/1.12
Projects
None yet
Development

No branches or pull requests

10 participants