Differentiate between compute and network based occupancy #7004
Labels
diagnostics
discussion
Discussing a topic with no specific actions yet
enhancement
Improve existing functionality or make things work better
performance
scheduler
scheduling
stealing
Occupancy is an estimation of work the scheduler assigns to every worker. We compute this value in
Scheduler._set_duration_estimate
which is invoked in a couple of places_reevaluate_occupancy_worker
(periodically if scheduler CUP load allows is)Occupancy is measured in seconds and is calculated by summing the expected processing time of all tasks assigned to a worker. At all times, the invariant
sum(ws.processing.values()) ~ ws.occupancy
should hold (modulo floating point arithmetic errors).This processing time is defined as
TaskPrefix.duration_average + get_comm_cost(TaskState, WorkerState)
, i.e. the average compute duration of theTaskPrefix
(seeScheduler.get_task_duration
) and the estimated time to transfer all dependencies that are not, yet on that worker, seeScheduler.get_comm_cost
Occupancy is used for four purposes
Scheduler.total_occupancy
(sum over all workers) is used to define an adaptive targetScheduler.total_occupancy
(sum over all workers) is used to estimate worker saturationWorkerState.processing
to calculate the steal_time ratio in work stealingWorkerState.occupancy
for making a scheduling decision inScheduler.worker_objective
With the exception of the work stealing case, all other examples are very specifically referring to the number of worker threads. Worker threads do not impact network/gather data performance.
Taking
Scheduler.worker_objective
trying to calculatestart_time
as an example, the actual start time should rather beThis would likely increase the quality of our scheduling decisions and would very clearly avoid double counting problems like #7003
On top, this would add a significant observability component since we would directly visualize how much network vs compute work is expected from a worker. I could also see a ratio of the two values to be an interesting metric to track (similar to what work stealing is trying to do with the steal ratio)
The text was updated successfully, but these errors were encountered: