Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node downed through repeated promote/demotes #33580

Closed
dperny opened this issue Jun 7, 2017 · 7 comments
Closed

Node downed through repeated promote/demotes #33580

dperny opened this issue Jun 7, 2017 · 7 comments
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/17.06

Comments

@dperny
Copy link
Contributor

dperny commented Jun 7, 2017

Description

Repeatedly promoting and demoting a node has put it in a Down and Unreachable state. Doing docker info hangs, but doing docker ps works. This likely indicates some sort of deadlock in the Cluster subcomponent.

Steps to reproduce the issue:

  1. Rapidly promote 2 workers on a 3 node cluster.
  2. Demote those same 2 nodes
  3. Repeat until one of the nodes becomes Down/Unreachable.

Output of docker version:

root@dperny-linux-0:/home/ubuntu# docker version
Client:
 Version:      17.06.0-ce-rc2
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   402dd4a
 Built:        Wed Jun  7 10:04:47 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce-rc2
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   402dd4a
 Built:        Wed Jun  7 10:03:40 2017
 OS/Arch:      linux/amd64

Output of docker info:

N/A, docker info hangs.

Additional environment details (AWS, VirtualBox, physical, etc.):
3 Node cluster on AWS, t2.micro instances.

@dperny
Copy link
Contributor Author

dperny commented Jun 7, 2017

I kicked over the docker daemon, and this has happened:

root@dperny-linux-0:/home/ubuntu# docker info
Containers: 1
 Running: 0
 Paused: 0
 Stopped: 1
Images: 2
Server Version: 17.06.0-ce-rc2
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 18
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: error
 NodeID:
 Error: manager stopped: can't initialize raft node: rpc error: code = 6 desc = a raft member with this node ID already exists
 Is Manager: false
 Node Address: 172.31.39.17
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 3addd840653146c90a254301d6c3a663c7fd6429
runc version: 992a5be178a62e026f4069f443c6164912adbf09
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-1013-aws
Operating System: Ubuntu 16.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 990.7MiB
Name: dperny-linux-0
ID: PJFG:CLMD:GOUY:M7FT:M46F:OEWQ:CYV3:YT4C:DB32:NMKK:GN6A:HCAV
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 18
 Goroutines: 31
 System Time: 2017-06-07T23:23:37.247023606Z
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

And on the "primary" manager

root@dperny-linux-2:/home/ubuntu# docker node ls
ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
1m1fzcfli4dqyoadzyfkxz27f     dperny-linux-1      Ready               Active              Reachable
v3epwq0v2i5qyym2u0d0jkouo     dperny-linux-0      Down                Active              Unreachable
yxhx57tfvj5ycgb6xfw82ndvi *   dperny-linux-2      Ready               Active              Leader

@thaJeztah thaJeztah added area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Jun 8, 2017
@thaJeztah
Copy link
Member

/cc @aaronlehmann @aluzzardi

@aaronlehmann
Copy link
Contributor

aaronlehmann commented Jun 8, 2017 via email

@dperny
Copy link
Contributor Author

dperny commented Jun 8, 2017

Yes, it's a promotion that's failing. Sorry that I was unclear.

@aaronlehmann
Copy link
Contributor

I think the issue here is that we made demote async recently (it sounds weird, but it actually makes things way more solid). However, if you demote/promote rapidly, it's possible to promote a node before the demotion has actually gone through, and you would get the "a raft member with this node ID already exists" error that you're seeing in docker info.

Apparently we aren't handling this situation gracefully. The fact that docker info is hanging indeed suggests a deadlock. If you can reproduce this, it would be useful to grab a stack trace with SIGUSR1.

BTW, I believe this would no longer be an issue with moby/swarmkit#2198. All the more reason to get that merged :)

@aaronlehmann
Copy link
Contributor

Apparently we aren't handling this situation gracefully. The fact that docker info is hanging indeed suggests a deadlock. If you can reproduce this, it would be useful to grab a stack trace with SIGUSR1.

Did some debugging on this, and it looks like the issue is the one fixed by moby/swarmkit#2203

However, moby/swarmkit#2198 will probably be necessary to completely resolve the issue.

@sam-thibault
Copy link
Contributor

The mentioned fixes were merged. I will mark this complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/swarm kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/17.06
Projects
None yet
Development

No branches or pull requests

5 participants