Node downed through repeated promote/demotes #33580

dperny · 2017-06-07T23:11:46Z

Description

Repeatedly promoting and demoting a node has put it in a Down and Unreachable state. Doing docker info hangs, but doing docker ps works. This likely indicates some sort of deadlock in the Cluster subcomponent.

Steps to reproduce the issue:

Rapidly promote 2 workers on a 3 node cluster.
Demote those same 2 nodes
Repeat until one of the nodes becomes Down/Unreachable.

Output of docker version:

root@dperny-linux-0:/home/ubuntu# docker version
Client:
 Version:      17.06.0-ce-rc2
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   402dd4a
 Built:        Wed Jun  7 10:04:47 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce-rc2
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   402dd4a
 Built:        Wed Jun  7 10:03:40 2017
 OS/Arch:      linux/amd64

Output of docker info:

N/A, docker info hangs.

Additional environment details (AWS, VirtualBox, physical, etc.):
3 Node cluster on AWS, t2.micro instances.

The text was updated successfully, but these errors were encountered:

dperny · 2017-06-07T23:25:56Z

I kicked over the docker daemon, and this has happened:

root@dperny-linux-0:/home/ubuntu# docker info
Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 2
Server Version: 17.06.0-ce-rc2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 18
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: error
 NodeID:
 Error: manager stopped: can't initialize raft node: rpc error: code = 6 desc = a raft member with this node ID already exists
 Is Manager: false
 Node Address: 172.31.39.17
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 992a5be178a62e026f4069f443c6164912adbf09
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1013-aws
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 990.7MiB
Name: dperny-linux-0
ID: PJFG:CLMD:GOUY:M7FT:M46F:OEWQ:CYV3:YT4C:DB32:NMKK:GN6A:HCAV
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 18
 Goroutines: 31
 System Time: 2017-06-07T23:23:37.247023606Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

And on the "primary" manager

root@dperny-linux-2:/home/ubuntu# docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
1m1fzcfli4dqyoadzyfkxz27f     dperny-linux-1      Ready               Active              Reachable
v3epwq0v2i5qyym2u0d0jkouo     dperny-linux-0      Down                Active              Unreachable
yxhx57tfvj5ycgb6xfw82ndvi *   dperny-linux-2      Ready               Active              Leader

thaJeztah · 2017-06-08T00:51:03Z

/cc @aaronlehmann @aluzzardi

aaronlehmann · 2017-06-08T01:20:50Z

The steps say to repeat the promotion/demotion cycle, but the node ls output shows 3 managers in the broken state. Am I correct that it's actually promotion failing, not demotion?

dperny · 2017-06-08T18:50:06Z

Yes, it's a promotion that's failing. Sorry that I was unclear.

aaronlehmann · 2017-06-08T21:53:59Z

I think the issue here is that we made demote async recently (it sounds weird, but it actually makes things way more solid). However, if you demote/promote rapidly, it's possible to promote a node before the demotion has actually gone through, and you would get the "a raft member with this node ID already exists" error that you're seeing in docker info.

Apparently we aren't handling this situation gracefully. The fact that docker info is hanging indeed suggests a deadlock. If you can reproduce this, it would be useful to grab a stack trace with SIGUSR1.

BTW, I believe this would no longer be an issue with moby/swarmkit#2198. All the more reason to get that merged :)

aaronlehmann · 2017-06-13T21:13:43Z

Apparently we aren't handling this situation gracefully. The fact that docker info is hanging indeed suggests a deadlock. If you can reproduce this, it would be useful to grab a stack trace with SIGUSR1.

Did some debugging on this, and it looks like the issue is the one fixed by moby/swarmkit#2203

However, moby/swarmkit#2198 will probably be necessary to completely resolve the issue.

sam-thibault · 2023-11-10T12:46:54Z

The mentioned fixes were merged. I will mark this complete.

GordonTheTurtle added the version/17.06 label Jun 7, 2017

thaJeztah added area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Jun 8, 2017

sam-thibault closed this as completed Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node downed through repeated promote/demotes #33580

Node downed through repeated promote/demotes #33580

dperny commented Jun 7, 2017

dperny commented Jun 7, 2017

thaJeztah commented Jun 8, 2017

aaronlehmann commented Jun 8, 2017 via email

dperny commented Jun 8, 2017

aaronlehmann commented Jun 8, 2017

aaronlehmann commented Jun 13, 2017

sam-thibault commented Nov 10, 2023

Node downed through repeated promote/demotes #33580

Node downed through repeated promote/demotes #33580

Comments

dperny commented Jun 7, 2017

dperny commented Jun 7, 2017

thaJeztah commented Jun 8, 2017

aaronlehmann commented Jun 8, 2017 via email

dperny commented Jun 8, 2017

aaronlehmann commented Jun 8, 2017

aaronlehmann commented Jun 13, 2017

sam-thibault commented Nov 10, 2023