You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Queuing #6614 is meant to prevent root task overproduction #5555. And it's shown to be very effective at doing so: #7128.
However, due to the heuristic of what counts as a "root-ish" task, it'll only stop root task overproduction if you have > total_nthreads * 2 root tasks.
Overproduction can occur any time there are > total_nthreads root tasks. So in this middle case, queuing won't kick in and the worker-saturation value won't be respected.
This would be confusing behavior to users. If you make your problem size smaller, or make your cluster bigger—two things that you'd expect to reduce per-worker memory usage—you may cross an opaque magic threshold at which your workload suddenly uses up to 2x more memory.
EDIT:
To be clear, I propose a two-character change to fix this. Just drop the * 2 part:
diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index b99e3f19..df20e807 100644
--- a/distributed/scheduler.py+++ b/distributed/scheduler.py@@ -3033,7 +3033,7 @@ class SchedulerState:
tg = ts.group
# TODO short-circuit to True if `not ts.dependencies`?
return (
- len(tg) > self.total_nthreads * 2+ len(tg) > self.total_nthreads
and len(tg.dependencies) < 5
and sum(map(len, tg.dependencies)) < 5
)
The * 2 is a number @mrocklin and I just made up back in #4967. There wasn't any benchmarking or empirical reason for it. Just saying > nthreads is more logical and easier to justify.
The text was updated successfully, but these errors were encountered:
Queuing #6614 is meant to prevent root task overproduction #5555. And it's shown to be very effective at doing so: #7128.
However, due to the heuristic of what counts as a "root-ish" task, it'll only stop root task overproduction if you have >
total_nthreads * 2
root tasks.Overproduction can occur any time there are >
total_nthreads
root tasks. So in this middle case, queuing won't kick in and theworker-saturation
value won't be respected.This would be confusing behavior to users. If you make your problem size smaller, or make your cluster bigger—two things that you'd expect to reduce per-worker memory usage—you may cross an opaque magic threshold at which your workload suddenly uses up to 2x more memory.
EDIT:
To be clear, I propose a two-character change to fix this. Just drop the
* 2
part:The
* 2
is a number @mrocklin and I just made up back in #4967. There wasn't any benchmarking or empirical reason for it. Just saying> nthreads
is more logical and easier to justify.The text was updated successfully, but these errors were encountered: