WIP: Remove the lock at the scheduler level #1212

abronan · 2015-09-15T01:40:54Z

This PR removes the Lock at the scheduler level and moves it to node.Node for better granularity. cluster now maintains a list of Nodes with resource accounting protected by a Lock which avoids races for resource reservation. Processes not lucky enough to get a resource slice on a machine that gets full in the meantime are retrying the process of scheduling unless they get the resources they need (it fails if there is no more resource available cluster-wide).

This is I think the most simple approach to remove the lock at the scheduler level for now and avoid the races. Note that this does not solve the naming issue (we should settle if we allow duplicate names and solve the ambiguity through the prefix which is the node name or if each container must absolutely have a unique name).

Downside: The blocked pull problem still persists so some resources are going to be locked up for a while if this happens because we have no way to know if this is due to the issue or just a very slow pull.

TODO items:

Integration tests
Take a look at the implications on strategies and constraints, and parallel create/remove/build

Signed-off-by: Alexandre Beslic [email protected]

…e' level, resource reservation now go through a maintained list of 'Nodes' with their associated resource status Signed-off-by: Alexandre Beslic <[email protected]>

abronan · 2015-09-15T01:58:46Z

CI failing on all the constraints/affinity tests, kind of makes sense.. This solves the simple parallel create issue on resource constraints but impacts the rest. Should look into it..

abronan · 2015-09-16T00:44:26Z

Failure is obvious, I give a list of Nodes emptied from their containers and images to SelectNodeForContainer thus it fails on constraint and affinities. Works fine otherwise. Will fix the issue with listNodes to statically account for used resources while including the actual state of existing Containers and Images.

tnachen · 2015-09-16T00:46:39Z

scheduler/node/node.go

+	if config != nil {
+		memory := config.Memory
+		cpus := config.CpuShares
+		n.UsedMemory = n.UsedMemory - memory


I think you should check that this is >= 0, otherwise something really wrong happened

Agreed 👍

abronan · 2015-10-09T20:08:31Z

Closing this one in favor of #1261, will maybe revisit for 0.6 but until then the solution in #1261 plays nicer with scheduler plugins while this one could've ended with some side effects.

remove the lock at the scheduler level and places it at the 'node.Nod…

2d0e30b

…e' level, resource reservation now go through a maintained list of 'Nodes' with their associated resource status Signed-off-by: Alexandre Beslic <[email protected]>

GordonTheTurtle added the status/0-triage label Sep 15, 2015

tnachen reviewed Sep 16, 2015
View reviewed changes

abronan mentioned this pull request Sep 16, 2015

Reserved Memory not release #1217

Closed

vieux mentioned this pull request Sep 23, 2015

mesos: no resources available to schedule container #1183

Closed

abronan closed this Oct 9, 2015

abronan deleted the fix_scheduler branch August 19, 2016 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Remove the lock at the scheduler level #1212

WIP: Remove the lock at the scheduler level #1212

abronan commented Sep 15, 2015

abronan commented Sep 15, 2015

abronan commented Sep 16, 2015

tnachen Sep 16, 2015

abronan Sep 16, 2015

abronan commented Oct 9, 2015

WIP: Remove the lock at the scheduler level #1212

WIP: Remove the lock at the scheduler level #1212

Conversation

abronan commented Sep 15, 2015

abronan commented Sep 15, 2015

abronan commented Sep 16, 2015

tnachen Sep 16, 2015

Choose a reason for hiding this comment

abronan Sep 16, 2015

Choose a reason for hiding this comment

abronan commented Oct 9, 2015