Withhold root tasks [no co assignment] #6614

gjoseph92 · 2022-06-22T23:09:19Z

This PR withholds root tasks on the scheduler in a global priority queue. Non-root tasks are unaffected.

Workers are only sent as many root tasks as they have threads, by default. This factor can be configured via distributed.scheduler.worker-saturation (1.5 would send workers 1.5x as many tasks than they have threads, for example). Setting this config value to inf completely disables scheduler-side queuing and retains the current scheduling behavior ~~(minus co-assignment)~~.

This disregards root task co-assignment. Benchmarking will determine whether fixing root task overproduction is enough of a gain to be worth giving up (flawed) co-assignment. Root task assignment here is typically worst-possible-case: neighboring tasks will usually all be assigned to different workers.

~~I also could/will easily add back co-assignment when distributed.scheduler.worker-saturation is inf~~ EDIT: done. With that, this PR would be entirely feature-flaggable—we could merge it with the default set to inf and see zero change in scheduling out of the box.

Closes #6560, closes #6631, closes #6597 (with withholding mode turned off at least)

Supersedes #6584, which did the same, but for all tasks (even non-root). It also co-mingled unrunnable tasks (due to restrictions) and queued root tasks, which seemed unwise.

Tests added / passed
Passes pre-commit run --all-files

Idea was that if a `SortedSet` of unrunnable tasks is too expensive, then insertion order is probably _approximately_ priority order, since higher-priority (root) tasks will be scheduled first. This would give us O(1) for all necessary operations, instead of O(logn) for adding and removing. Interestingly, the SortedSet implementation could be hacked to support O(1) `pop` and `popleft`, and inserting a min/max value. In the most common case (root tasks), we're always inserting a value that's greater than the max. Something like this might be the best tradeoff, since it gives us O(1) in the common case but still maintains the sorted gaurantee, which is easier to reason about.

Now task states on the dashboard are listed in the logical order that tasks transition through.

Simpler, though I think basically just an int of 1 may be the most reasonable.

This is just a hack currently, but maybe it would actually be useful?

This reverts commit df11f719b59aad11f39a27ccae7b2fd4dfd9243a.

When there were multiple root task groups, we were just re-using the last worker for every batch because it had nothing processing on it. Unintentionally this also fixes dask#6597 in some cases (because the first task goes to processing, but we measure queued, so we pick the same worker for both task groups)

This reverts commit fdd5fd9.

1. The family metric itself is flawed. Added linear chain traversal, but it's still not good. The maxsize is problematic and probably the wrong way to think about it? a) there's quite likely no maxsize parameter that will ever be right, because you could always have multiple independent crazy substructures that are each maxsize+1. b) even when every task would be in the same family because they're all interconnected, there's still benefit to scheduling subsequent things together, even if you do partition. Minimizing priority partitions is always what you want. Maybe there's something where maxsize is not a hard cutoff, but a cutoff for where to split up interconnected structures? 2. Families probably need to be data structures? When a task completes, you'd like to know if it belongs to a family that actually has more tasks to run on that worker, vs the task just happens to look like it belongs to a family but was never scheduled as a rootish task. Overall I like the family structure for scheduling up/down scaling, but figuring out how to identify them is tricky. Partitioning priority order is great because it totally avoids this problem, of course at the expense of scaling. Can we combine priority and graph structure to identify isolated families when reasonable, partition on priority when not?

Update docstring and add back logic for queuing disabled case

Just easier to explain this way

I think this fix is reasonable? I wonder if occupancy should include queued tasks though?

github-actions · 2022-06-23T02:12:43Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±    0       15 suites ±0 7h 1m 15s ⏱️ + 18m 18s
  3 071 tests +  18   2 984 ✔️ +  15   85 💤 +1 2 ❌ +2
22 729 runs +144 21 740 ✔️ +137 986 💤 +4 3 ❌ +3

For more details on these failures, see this check.

Results for commit 093d7dc. ± Comparison against base commit 817ead3.

♻️ This comment has been updated with latest results.

feeling pretty good about just `test_graph_execution_width`

Using it as an API saves having to manage `running` and `idle` in multiple places

…no-co-assign

hesitant on this, but I don't want to introduce a flaky test

gjoseph92 · 2022-08-31T06:00:21Z

@fjetter I believe all comments have been addressed.

For tests, I went with just test_graph_execution_width. I removed the process memory test. I liked the simplicity of your test suggestion, but test_graph_execution_width is slightly more thorough towards one edge case.

fjetter

I pushed another commit to address a merge conflict. If CI is green(ish) I'll go ahead and merge

fjetter · 2022-08-31T10:54:44Z

I agree we should aim to get this in main quickly and then further iterate. I can see two avenues:

get it in as-is (post minor tweaks), and then run performance benchmarks vs a branch where is_rootish simply returns true. This potentially means getting in main a lot of code to then remove it shortly afterwards.

run perf benchmarks now, before merging, to prove that the is_rootish heuristic is indeed needed, albeit it may be tweaked in the future. To clarify I don't propose to benchmark wildly different tweaks to is_rootish; I would just like a battery of tests with different use cases showing

main

this PR with worker-saturation: inf (no regression vs main - just for safety)

this PR with worker-saturation: 1.2

this PR with worker-saturation: 1.2, but with return True at the top of is_rootish

We discussed this early on, before we even started implementation. We agreed to merge this behind the feature flag since this will not change the behavior compared to main.
The goal is to set a default parameter for this value asap by running benchmarks. If we are not happy with the performance or cannot find a value that is a sane default, we may even rip this entire thing out again.

dcherian · 2022-09-15T15:55:19Z

@gjoseph92 (and everyone else involved here), thank you! How do I test it out ;)?

with dask.config.set({"distributed.scheduler.worker-saturation": 1.5}):
    result.compute()

Is this right? What range of values should I provide: inf as an upper-bound is not very useful.

gjoseph92 · 2022-09-15T17:06:20Z

There are many different ways to set dask config, and depending on how your clusters are deployed (local vs dask-gateway/pangeo vs dask-cloudprovider vs coiled, etc.), the way to set that config will vary.

Often though, the easiest can be to set an environment variable on the cluster. (Pangeo / dask-gateway docs, coiled docs, saturn docs, dask-cloudprovider seems to support env_vars=.) Note that the variable needs to be set before the scheduler starts—once the scheduler has started, setting it will have no effect.

$ DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION=1.0 dask-scheduler

For a local cluster, when creating your cluster, you can just do:

with dask.config.set({"distributed.scheduler.worker-saturation": 1.0}):
    client = distributed.Client(n_workers=..., threads_per_worker=...)

If you can't get the config to work, it's possible to change the setting on a live cluster. You could also use this to try different settings without re-creating the cluster. Only run this while the scheduler is idle (no tasks). Otherwise, you'll probably break your cluster.

# enable queuing (new behavior)
client.run_on_scheduler(lambda dask_scheduler: setattr(dask_scheduler, "WORKER_SATURATION", 1.0))

# disable queuing (old behavior)
client.run_on_scheduler(lambda dask_scheduler: setattr(dask_scheduler, "WORKER_SATURATION", float("inf")))

What range of values should I provide: inf as an upper-bound is not very useful

I would try the 1.0 - 2.0 range. I would expect 1.0 to usually be what you want. We are doing some benchmarking, and hopefully will figure out what a good value is across the board, and remove the need/ability to set this value in the future.

@dcherian and anyone who tries this, please report back with your findings, regardless of what they are! We would really like to hear how this works on real-world uses.

jrbourbeau · 2022-09-23T19:32:05Z

Just checking in here, @dcherian any luck trying things out? Happy to help out

TomNicholas · 2022-10-17T20:10:42Z

Great to see this merged (and exciting to see Deepak's results too)!

Now that we have 2022.9.2 on the LEAP hub (2i2c-org/infrastructure#1769), I'm trying this out there.

Unfortunately I can't seem to set the worker saturation option successfully. 😕

Setting via the gateway cluster options manager isn't working - if I do this

options.environment = {"MALLOC_TRIM_THRESHOLD_": "0"}
gc = g.new_cluster(cluster_options=options)

the cluster starts as expected, but if I do this

options.environment = {"MALLOC_TRIM_THRESHOLD_": "0", "DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION": 1.1}}
gc = g.new_cluster(cluster_options=options)

then it hangs indefinitely on cluster creation.

I'm not quite sure where or when I'm supposed to run this
$ DASK_DISTRIBUTED__SCHEDULER__WORKER_SATURATION=1.0 dask-scheduler - it creates a new scheduler? Is that different to the cluster?

I also tried client.run_on_scheduler(lambda dask_scheduler: setattr(dask_scheduler, "WORKER_SATURATION", 1.1))
but when I did the check suggested on the pangeo cloud docs I just get an empty dict back.

Co-authored-by: crusaderky <[email protected]> Co-authored-by: fjetter <[email protected]>

gjoseph92 added 28 commits June 22, 2022 17:01

Queue root tasks scheduler-side

6b6651b

Show queued tasks with crosshatching on dashboard

6225d1a

improve reasonableness of task-state order

1496abb

Now task states on the dashboard are listed in the logical order that tasks transition through.

Allow configurable oversaturation

7457865

Only support floats for worker-oversaturation

67e9bd2

Simpler, though I think basically just an int of 1 may be the most reasonable.

Push memory limits a little more in test

2410a82

Queued tasks on info pages

49d5ddd

driveby: WIP color task graph by worker

b546997

This is just a hack currently, but maybe it would actually be useful?

Revert "driveby: WIP color task graph by worker"

2b44820

This reverts commit df11f719b59aad11f39a27ccae7b2fd4dfd9243a.

Queued tasks on graph

e494e87

Redistribute queues when new worker joins

ad417ed

Fix task_slots_available when queuing disabled

b4c698e

Revert "unused: OrderedSet collection"

b514e84

This reverts commit fdd5fd9.

Fix potential stale worker use in decide_worker

1835a89

Withhold root tasks [no co-assignment]

0f6603c

Factor out _add_to_processing

e10fdca

Factor out _propagage_released

3eb1d68

Update check_idle_saturated

c685b3c

Update docstring and add back logic for queuing disabled case

Fix topk for 0/negative values

e1dda98

Tests for HeapSet.topk

f811246

fix mypy

d347b32

worker-oversaturation -> worker-saturation

1990dd7

Just easier to explain this way

fixup! Factor out _add_to_processing

be1b9ca

fix test_queued_tasks_rebalance

85f9120

Fix occupancy tests

bb08c8d

I think this fix is reasonable? I wonder if occupancy should include queued tasks though?

gjoseph92 mentioned this pull request Jun 22, 2022

[DNM] Don't queue tasks on workers #6584

Closed

2 tasks

gjoseph92 added 4 commits August 30, 2022 18:52

remove test_near_memory_limit_workload

2b3f6ae

feeling pretty good about just `test_graph_execution_width`

handle_worker_status_change in retire_workers

5e4d53d

Using it as an API saves having to manage `running` and `idle` in multiple places

Merge remote-tracking branch 'upstream/main' into withold-root-tasks-…

333bbb2

…no-co-assign

avoid flaky test_graph_execution_width

acc524f

hesitant on this, but I don't want to introduce a flaky test

gjoseph92 and others added 3 commits August 31, 2022 00:34

fix test_decide_worker_coschedule_order_binary_op

ba336b9

fixup! handle_worker_status_change

9d99d74

Fix merge conflict of renaming transfer log

093d7dc

fjetter approved these changes Aug 31, 2022

View reviewed changes

fjetter merged commit dd81b42 into dask:main Aug 31, 2022

gjoseph92 deleted the withold-root-tasks-no-co-assign branch August 31, 2022 15:00

gjoseph92 restored the withold-root-tasks-no-co-assign branch August 31, 2022 15:07

gjoseph92 deleted the withold-root-tasks-no-co-assign branch August 31, 2022 15:07

gjoseph92 mentioned this pull request Sep 1, 2022

Performance regressions after queuing PR coiled/benchmarks#295

Open

fjetter mentioned this pull request Sep 1, 2022

Release 2022.9.0 dask/community#270

Closed

5 tasks

crusaderky mentioned this pull request Sep 1, 2022

Automation of benchmark comparison coiled/benchmarks#292

Closed

gjoseph92 mentioned this pull request Sep 6, 2022

⚠️ CI failed ⚠️ coiled/benchmarks#299

Closed

jrbourbeau mentioned this pull request Sep 23, 2022

[Fail case] Almost-blockwise weighted arithmetic vorticity calculation pangeo-data/distributed-array-examples#1

Open

hendrikmakait mentioned this pull request Sep 29, 2022

worker-saturation impacts balancing in work-stealing #7085

Closed

gjoseph92 mentioned this pull request Oct 12, 2022

an example that shows the need for memory backpressure #2602

Closed

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Withhold root tasks [no co assignment] (dask#6614)

b3b23aa

Co-authored-by: crusaderky <[email protected]> Co-authored-by: fjetter <[email protected]>

This was referenced Nov 7, 2022

Handle edge cases between queued and no-worker #7259

Closed

Queuing does not prevent root task overproduction unless you have enough tasks #7273

Open

fjetter mentioned this pull request Nov 9, 2022

Revert idle classification when worker-saturation is set #7278

Merged

gjoseph92 mentioned this pull request Feb 24, 2023

dask.order over-prioritizes root tasks in some situations dask/dask#9995

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Withhold root tasks [no co assignment] #6614

Withhold root tasks [no co assignment] #6614

gjoseph92 commented Jun 22, 2022 •

edited

Loading

github-actions bot commented Jun 23, 2022 •

edited

Loading

gjoseph92 commented Aug 31, 2022

fjetter left a comment

fjetter commented Aug 31, 2022

dcherian commented Sep 15, 2022

gjoseph92 commented Sep 15, 2022

jrbourbeau commented Sep 23, 2022

TomNicholas commented Oct 17, 2022 •

edited

Loading

Withhold root tasks [no co assignment] #6614

Withhold root tasks [no co assignment] #6614

Conversation

gjoseph92 commented Jun 22, 2022 • edited Loading

github-actions bot commented Jun 23, 2022 • edited Loading

Unit Test Results

gjoseph92 commented Aug 31, 2022

fjetter left a comment

Choose a reason for hiding this comment

fjetter commented Aug 31, 2022

dcherian commented Sep 15, 2022

gjoseph92 commented Sep 15, 2022

jrbourbeau commented Sep 23, 2022

TomNicholas commented Oct 17, 2022 • edited Loading

gjoseph92 commented Jun 22, 2022 •

edited

Loading

github-actions bot commented Jun 23, 2022 •

edited

Loading

TomNicholas commented Oct 17, 2022 •

edited

Loading