You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently on the scheduler, when a task is assigned to a worker and consumes resources, that's set in one place. When deciding whether a task can be assigned to a worker, that's checked in a different place. Therefore, current resource consumption levels are not considered in task scheduling.
The current scheduling appears to just consider which workers can run a task in theory: do they have enough of the resource to be able to run this task ever (even if none of it is available right now)?
Considering resources like GPUs, I suppose this makes sense: queuing extra tasks onto workers is beneficial so there's no idleness. Still, it's a little surprising. And the fact that worker_objective doesn't take current resource consumption into account seems likely to cause bad scheduling, since we could easily assign a task to a worker whose resource is currently used up, when there are other workers with the resource available.
When a task gets assigned to a worker, consume_resources only adjusts the count in WorkerState.used_resources:
But SchedulerState.valid_workers looks for which workers can run a task, it only checks self.resouces[resource][address], and never looks at WorkerState.used_resources:
So tasks will not enter the no-worker state just because all resources in the cluster are currently used up.
Instead, as usual, more tasks will get queued onto workers than they can run at once. Each worker will manage only running the correct number of tasks at once.
Is this intentional?
Why do we track resource counts in both self.resources and WorkerState.resources?
Why do we bother tracking WorkerState.used_resources if it's never actually used for scheduling decisions?
Note that changing this behavior would likely provide a viable temporary solution for #6360, a very common pain point for many users.
FWIW, I'm running into a problem where XGBoost does some weird thread management per node in my cluster. If the sum of the 'njobs' of all assigned xgboost training tasks to a node is > # of vCPUs, then it uses only 1 vCPU in total for all tasks.
Hence I was looking to worker resource management to solve this problem (by making each require the whole resource of the worker).
The problem then stemming from this ticket is it successfully completes the first task by itself using the full compute capacity (all 32 cores), and then tries to run all the remaining tasks at the same time (without assessing the resources available and queuing 1 at a time accordingly). Resulting in the cpu utilisation dropping to 1 / 32 cores.
Currently on the scheduler, when a task is assigned to a worker and consumes resources, that's set in one place. When deciding whether a task can be assigned to a worker, that's checked in a different place. Therefore, current resource consumption levels are not considered in task scheduling.
The current scheduling appears to just consider which workers can run a task in theory: do they have enough of the resource to be able to run this task ever (even if none of it is available right now)?
Considering resources like GPUs, I suppose this makes sense: queuing extra tasks onto workers is beneficial so there's no idleness. Still, it's a little surprising. And the fact that
worker_objective
doesn't take current resource consumption into account seems likely to cause bad scheduling, since we could easily assign a task to a worker whose resource is currently used up, when there are other workers with the resource available.When a task gets assigned to a worker,
consume_resources
only adjusts the count inWorkerState.used_resources
:distributed/distributed/scheduler.py
Lines 2674 to 2675 in e0ea5df
But
SchedulerState.valid_workers
looks for which workers can run a task, it only checksself.resouces[resource][address]
, and never looks atWorkerState.used_resources
:distributed/distributed/scheduler.py
Lines 2644 to 2652 in e0ea5df
So tasks will not enter the
no-worker
state just because all resources in the cluster are currently used up.Instead, as usual, more tasks will get queued onto workers than they can run at once. Each worker will manage only running the correct number of tasks at once.
self.resources
andWorkerState.resources
?WorkerState.used_resources
if it's never actually used for scheduling decisions?Note that changing this behavior would likely provide a viable temporary solution for #6360, a very common pain point for many users.
cc @mrocklin @fjetter
The text was updated successfully, but these errors were encountered: