Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

Overloaded swarm node breaks the rest of the cluster #786

Closed
zstarer opened this issue May 14, 2015 · 3 comments
Closed

Overloaded swarm node breaks the rest of the cluster #786

zstarer opened this issue May 14, 2015 · 3 comments
Assignees
Milestone

Comments

@zstarer
Copy link

zstarer commented May 14, 2015

Hi guys -

We've been experimenting (A LOT) with swarm lately. One issue we've just discovered is that overloading any particular node in the cluster can break certain functionality expected from the master.

On one node, we had ~150 containers running , and can no longer hit the docker daemon (on that node) to see what's running (I think there's just no resources left) however, we can still hit the swarm master to return some data:

ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 169
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 2.734 GiB / 15.69 GiB
ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 35
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 800 MiB / 15.69 GiB
ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 35
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 1.562 GiB / 15.69 GiB
ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 35
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 3.125 GiB / 15.69 GiB
ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 65
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 2.344 GiB / 15.69 GiB

running a 'docker ps -a' from the master also returns data..however, we cannot run commands that then impact -any- slave in our pool.

imagine container1 was running on the node with 169 containers, and container2 is elsewhere.
I could not run 'docker rm -f container1' or 'docker rm -f container2'

please let me know if there are already open items for anything similar, I'd like to read more about this.

@aluzzardi
Copy link
Contributor

Thanks for reporting this, @zstarer.

There must be a deadlock somewhere making the Cluster object freeze.

We are going to reproduce the issue in our integration tests so we can fix it and make sure it stays fixed.

@aluzzardi aluzzardi added this to the 0.3.0 milestone May 17, 2015
@aluzzardi aluzzardi self-assigned this May 27, 2015
@aluzzardi aluzzardi assigned abronan and unassigned aluzzardi Jun 10, 2015
@aluzzardi aluzzardi modified the milestones: 0.3.0, 0.4.0 Jun 29, 2015
@aluzzardi
Copy link
Contributor

So this is related to parallel scheduling which we are going to address in the next release

@abronan
Copy link
Contributor

abronan commented Oct 13, 2015

This issue should be fixed now with #1261. Closing this one. Thanks and let us know if you encounter any more issue with master.

@abronan abronan closed this as completed Oct 13, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants