Overloaded swarm node breaks the rest of the cluster #786

zstarer · 2015-05-14T02:00:10Z

Hi guys -

We've been experimenting (A LOT) with swarm lately. One issue we've just discovered is that overloading any particular node in the cluster can break certain functionality expected from the master.

On one node, we had ~150 containers running , and can no longer hit the docker daemon (on that node) to see what's running (I think there's just no resources left) however, we can still hit the swarm master to return some data:

ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 169
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 2.734 GiB / 15.69 GiB
ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 35
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 800 MiB / 15.69 GiB
ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 35
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 1.562 GiB / 15.69 GiB
ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 35
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 3.125 GiB / 15.69 GiB
ip-10-0-.ec2.internal: 10.0.:2375
└ Containers: 65
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 2.344 GiB / 15.69 GiB

running a 'docker ps -a' from the master also returns data..however, we cannot run commands that then impact -any- slave in our pool.

imagine container1 was running on the node with 169 containers, and container2 is elsewhere.
I could not run 'docker rm -f container1' or 'docker rm -f container2'

please let me know if there are already open items for anything similar, I'd like to read more about this.

aluzzardi · 2015-05-17T05:05:11Z

Thanks for reporting this, @zstarer.

There must be a deadlock somewhere making the Cluster object freeze.

We are going to reproduce the issue in our integration tests so we can fix it and make sure it stays fixed.

aluzzardi · 2015-07-30T01:38:19Z

So this is related to parallel scheduling which we are going to address in the next release

abronan · 2015-10-13T05:32:50Z

This issue should be fixed now with #1261. Closing this one. Thanks and let us know if you encounter any more issue with master.

aluzzardi added this to the 0.3.0 milestone May 17, 2015

aluzzardi added kind/bug priority/P1 labels May 17, 2015

aluzzardi self-assigned this May 27, 2015

aluzzardi assigned abronan and unassigned aluzzardi Jun 10, 2015

aluzzardi modified the milestones: 0.3.0, 0.4.0 Jun 29, 2015

aluzzardi modified the milestones: 0.5.0, 0.4.0 Jul 30, 2015

abronan mentioned this issue Aug 26, 2015

Fix panic on Container/Image refresh after Engine removal #1168

Merged

aanm mentioned this issue Sep 3, 2015

Swarm has deadlock with the scheduler #1194

Closed

abronan mentioned this issue Sep 15, 2015

WIP: Remove the lock at the scheduler level #1212

Closed

2 tasks

abronan closed this as completed Oct 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overloaded swarm node breaks the rest of the cluster #786

Overloaded swarm node breaks the rest of the cluster #786

zstarer commented May 14, 2015

aluzzardi commented May 17, 2015

aluzzardi commented Jul 30, 2015

abronan commented Oct 13, 2015

Overloaded swarm node breaks the rest of the cluster #786

Overloaded swarm node breaks the rest of the cluster #786

Comments

zstarer commented May 14, 2015

aluzzardi commented May 17, 2015

aluzzardi commented Jul 30, 2015

abronan commented Oct 13, 2015