-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[question] Unbalanced job scheduling #1990
Comments
Hey @cleiter, Nomad uses a bin-packing algorithm. This will cause it to fill nodes before moving on to others which is the behavior you are observing. This is to allow more optimal placements of mixed workloads. Currently we do not have controls to allow spread across DCs but that is something we would like to add in the future. If you must have your tasks spread across the two datacenters you have two options:
Let me know if you have any other questions and if this answered your question would you mind closing the issue. Thanks! |
Hi @dadgar, thanks for the explanation! But I have to say this behavior is still quite surprising to me. I just don't see how not using the resources available in the cluster - leaving machines totally idle while others have multiple jobs - might be a desired scheduling choice. Seems bad for performance and robustness. The only thing I can think of is that if I ever have a job that needs 100% available memory of a node then it might be good to have a spare one for that. Comparing that to Kubernetes (which I've actually never used): https://github.com/kubernetes/kubernetes/blob/master/docs/devel/scheduler_algorithm.md#ranking-the-nodes The priority functions LeastRequestedPriority, BalancedResourceAllocation, SelectorSpreadPriority seem to be what I'd have expected. What's the rational for Nomad's behavior? Are there any plans in making the Nomad scheduling algorithm configurable, e.g. by supplying a number of priority functions? |
I've just increased the cpu resource requirements which leads to a more even distribution of jobs. The disadvantage of that is that if one datacenter goes down Nomad might not be able to move jobs to other nodes because of these constraints. 😕 |
@cleiter I think there should be a spread between DCs I am not disagreeing there! In terms of the link you posted we do: SelectorSpreadPriority and bin-packing. This is a pretty studied part of schedulers and spread is among the worse choices for utilization. The borg paper and its references are great reads if you are interested: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf |
@dadgar Thanks for the link, I read the parts regarding scheduling. I think it depends on the point of view. If you have a more local view it makes sense to run as many services as possible on a single node. It's efficient regarding the total number of nodes required, that's obviously better utilization of a single machine. The other perspective is when you have a fixed number of nodes and want to spread the jobs across all those nodes. This seems more efficient performance wise. Services can use more CPU power than they are guaranteed and there's less concurrent I/O. Another advantage is that if one node goes down there are less services to move to other nodes. I would say that's better utilization of the whole cluster. The second scenario is what I have and what I image is not that uncommon. To exaggerate a bit, let's say you have 10 nodes which are capable of running 10 services each. When you deploy 10 services, would you rather have them all on a single node or a single service running on each node? :) Don't get me wrong, I really like Nomad and I think the current approach makes sense, but the second scenario seems to be a plausible use case as well. I'm not a scheduling theory expert, this is just what I intuitively expected, so feel free to convince me otherwise. :) I think I can use datacenter constraints so that each service is deployed exactly once to each datacenter to get the behavior I image (performance, resilience, utilization) but I would be really happy if Nomad offered an option to prioritize scheduling to the nodes with the lowest current workload. |
Again want to be clear I agree the DC behavior is unexpected and should be addressed. The reason you don't want the lowest current workload scheduling is what happens when you have 1 service on each of the ten nodes and then a new job is submitted that requires all the resources of one of the nodes! You can't schedule it when with any amount of bin-packing you would have many free nodes. You should reserve resources for your tasks so they meet their SLO independent of how many other tasks are on the same machine |
This depends on the usage. In a microservice architecture I know I won't have jobs which need 100% of the resources of a single node and I would be happier if 1/3 of the machines I'm paying for aren't idle all the time. What should happen if you schedule a job which needs more resources than are currently available on a single machine is described in the paper you linked above:
I would still argue for a configuration option so that every user can tweak the node priorization to their needs. Feel free to close this issue, I opted for the dc constraints and now the usage is more balanced. :) |
Sure! Hope this didn't come off as an argument! Was simply discussing it with you :) Yeah once Nomad has pre-emption we can re-access some of the scheduling decisions! I am looking forward to that :) Thanks, |
Yeah, sure. :) Overall I'm really happy with Nomad and I appreciate the work you put into it! It's still a young project so I'm sure there will be many improvements in the future. |
like databae , mirgate cost very much. I do not know how many resource may be used at first. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I'm running Nomad 0.5.0-rc1 (same behavior was with 0.4.1) with Consul 0.7.0 and I have deployed Nomad/Consul in 3 datacenters, with 1 server and 2 clients in each datacenter. All client and server nodes are of the same instance type and are configured identically.
My jobs are docker jobs with 2 instances and have exactly 1 group and 1 task with
distinct_hosts = true
anddatacenters = [ "eu-west-1a", "eu-west-1b", "eu-west-1c" ]
.The current allocations look like this:
No matter how often I redeploy jobs Nomad seems to dislike datacenter B for some reason. When I set the instance count to say 5 it will actually deploy jobs to nodes in datacenter B which means they are working correctly. When I switch back to 2 instances it will remove them from B and those nodes are idle again. So these machines are idle all the time while other machines run 5-6 services.
What's also a bit concerning is that now jobs are actually running twice in the same datacenter (on worker 0 and 1) while I'd like Nomad to distribute them across datacenters, and as far as I understood the documentation it should.
I wasn't able to find anything interesting in the logs and I don't know a way to see Nomad's reasoning for job allocation. Is this expected behavior or is there something I could do to have the jobs more evenly distributed? Any help is appreciated.
The text was updated successfully, but these errors were encountered: