Queuing does not prevent root task overproduction unless you have enough tasks #7273

gjoseph92 · 2022-11-08T18:23:22Z

Queuing #6614 is meant to prevent root task overproduction #5555. And it's shown to be very effective at doing so: #7128.

However, due to the heuristic of what counts as a "root-ish" task, it'll only stop root task overproduction if you have > total_nthreads * 2 root tasks.

Overproduction can occur any time there are > total_nthreads root tasks. So in this middle case, queuing won't kick in and the worker-saturation value won't be respected.

This would be confusing behavior to users. If you make your problem size smaller, or make your cluster bigger—two things that you'd expect to reduce per-worker memory usage—you may cross an opaque magic threshold at which your workload suddenly uses up to 2x more memory.

EDIT:

To be clear, I propose a two-character change to fix this. Just drop the * 2 part:

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index b99e3f19..df20e807 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -3033,7 +3033,7 @@ class SchedulerState:
         tg = ts.group
         # TODO short-circuit to True if `not ts.dependencies`?
         return (
-            len(tg) > self.total_nthreads * 2
+            len(tg) > self.total_nthreads
             and len(tg.dependencies) < 5
             and sum(map(len, tg.dependencies)) < 5
         )

The * 2 is a number @mrocklin and I just made up back in #4967. There wasn't any benchmarking or empirical reason for it. Just saying > nthreads is more logical and easier to justify.

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2022-11-08T18:24:44Z

An easy reproducer: test_graph_execution_width fails if you run it on graph sizes in this range.

diff --git a/distributed/tests/test_scheduler.py b/distributed/tests/test_scheduler.py
index 48ad99db..0294b524 100644
--- a/distributed/tests/test_scheduler.py
+++ b/distributed/tests/test_scheduler.py
@@ -326,13 +326,14 @@ async def test_decide_worker_rootish_while_last_worker_is_retiring(c, s, a):
         await wait(xs + ys)
 
 
+@pytest.mark.parametrize("n_roots", [6, 9, 16, 32])
 @pytest.mark.slow
 @gen_cluster(
     nthreads=[("", 2)] * 4,
     client=True,
     config={"distributed.scheduler.worker-saturation": 1.0},
 )
-async def test_graph_execution_width(c, s, *workers):
+async def test_graph_execution_width(c, s, *workers, n_roots):
     """
     Test that we don't execute the graph more breadth-first than necessary.
 
@@ -357,7 +358,7 @@ async def test_graph_execution_width(c, s, *workers):
                 self.log.append(self.count)
                 type(self).count -= 1
 
-    roots = [delayed(Refcount)() for _ in range(32)]
+    roots = [delayed(Refcount)() for _ in range(n_roots)]
     passthrough1 = [delayed(slowidentity)(r, delay=0) for r in roots]
     passthrough2 = [delayed(slowidentity)(r, delay=0) for r in passthrough1]
     done = [delayed(lambda r: None)(r) for r in passthrough2]

FAILED distributed/tests/test_scheduler.py::test_graph_execution_width[9] - AssertionError: assert 9 <= 8
FAILED distributed/tests/test_scheduler.py::test_graph_execution_width[16] - AssertionError: assert 16 <= 8
======================================================================== 2 failed, 2 passed in 3.07s ========================================================================

gjoseph92 added memory scheduling labels Nov 8, 2022

This was referenced Nov 8, 2022

Tasks which are obviously root tasks not considered rootish #7274

Open

Round-robin worker selection makes poor choices with worker-saturation > 1.0 #7197

Open

worker-saturation impacts balancing in work-stealing #7085

Closed

gjoseph92 mentioned this issue Nov 16, 2022

Distributed scheduler does not obey dask.order.order for num_workers=1, num_threads=1 #5555

Open

gjoseph92 mentioned this issue Dec 13, 2022

Assign tasks to idle workers rather than queuing them #7384

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queuing does not prevent root task overproduction unless you have enough tasks #7273

Queuing does not prevent root task overproduction unless you have enough tasks #7273

gjoseph92 commented Nov 8, 2022 •

edited

Loading

gjoseph92 commented Nov 8, 2022

Queuing does not prevent root task overproduction unless you have enough tasks #7273

Queuing does not prevent root task overproduction unless you have enough tasks #7273

Comments

gjoseph92 commented Nov 8, 2022 • edited Loading

gjoseph92 commented Nov 8, 2022

gjoseph92 commented Nov 8, 2022 •

edited

Loading