Skip to content
This repository has been archived by the owner on Feb 1, 2021. It is now read-only.

WIP: Remove the lock at the scheduler level #1212

Closed
wants to merge 1 commit into from

Conversation

abronan
Copy link
Contributor

@abronan abronan commented Sep 15, 2015

fixes #1194 #786

This PR removes the Lock at the scheduler level and moves it to node.Node for better granularity. cluster now maintains a list of Nodes with resource accounting protected by a Lock which avoids races for resource reservation. Processes not lucky enough to get a resource slice on a machine that gets full in the meantime are retrying the process of scheduling unless they get the resources they need (it fails if there is no more resource available cluster-wide).

This is I think the most simple approach to remove the lock at the scheduler level for now and avoid the races. Note that this does not solve the naming issue (we should settle if we allow duplicate names and solve the ambiguity through the prefix which is the node name or if each container must absolutely have a unique name).

Downside: The blocked pull problem still persists so some resources are going to be locked up for a while if this happens because we have no way to know if this is due to the issue or just a very slow pull.

TODO items:

  • Integration tests
  • Take a look at the implications on strategies and constraints, and parallel create/remove/build

Signed-off-by: Alexandre Beslic [email protected]

…e' level, resource reservation now go through a maintained list of 'Nodes' with their associated resource status

Signed-off-by: Alexandre Beslic <[email protected]>
@abronan
Copy link
Contributor Author

abronan commented Sep 15, 2015

CI failing on all the constraints/affinity tests, kind of makes sense.. This solves the simple parallel create issue on resource constraints but impacts the rest. Should look into it..

@abronan
Copy link
Contributor Author

abronan commented Sep 16, 2015

Failure is obvious, I give a list of Nodes emptied from their containers and images to SelectNodeForContainer thus it fails on constraint and affinities. Works fine otherwise. Will fix the issue with listNodes to statically account for used resources while including the actual state of existing Containers and Images.

if config != nil {
memory := config.Memory
cpus := config.CpuShares
n.UsedMemory = n.UsedMemory - memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should check that this is >= 0, otherwise something really wrong happened

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 👍

@abronan
Copy link
Contributor Author

abronan commented Oct 9, 2015

Closing this one in favor of #1261, will maybe revisit for 0.6 but until then the solution in #1261 plays nicer with scheduler plugins while this one could've ended with some side effects.

@abronan abronan closed this Oct 9, 2015
@abronan abronan deleted the fix_scheduler branch August 19, 2016 21:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Swarm has deadlock with the scheduler
3 participants