Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Withhold root tasks [no co assignment] #6614

Merged
merged 104 commits into from
Aug 31, 2022

Commits on Jun 22, 2022

  1. unused: OrderedSet collection

    Idea was that if a `SortedSet` of unrunnable tasks is too expensive, then insertion order is probably _approximately_ priority order, since higher-priority (root) tasks will be scheduled first. This would give us O(1) for all necessary operations, instead of O(logn) for adding and removing.
    
    Interestingly, the SortedSet implementation could be hacked to support O(1) `pop` and `popleft`, and inserting a min/max value. In the most common case (root tasks), we're always inserting a value that's greater than the max. Something like this might be the best tradeoff, since it gives us O(1) in the common case but still maintains the sorted gaurantee, which is easier to reason about.
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    afedccd View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    6b6651b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6225d1a View commit details
    Browse the repository at this point in the history
  4. improve reasonableness of task-state order

    Now task states on the dashboard are listed in the logical order that tasks transition through.
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    1496abb View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    7457865 View commit details
    Browse the repository at this point in the history
  6. Only support floats for worker-oversaturation

    Simpler, though I think basically just an int of 1 may be the most reasonable.
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    67e9bd2 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    2410a82 View commit details
    Browse the repository at this point in the history
  8. Queued tasks on info pages

    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    49d5ddd View commit details
    Browse the repository at this point in the history
  9. driveby: WIP color task graph by worker

    This is just a hack currently, but maybe it would actually be useful?
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    b546997 View commit details
    Browse the repository at this point in the history
  10. Revert "driveby: WIP color task graph by worker"

    This reverts commit df11f719b59aad11f39a27ccae7b2fd4dfd9243a.
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    2b44820 View commit details
    Browse the repository at this point in the history
  11. Queued tasks on graph

    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    e494e87 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    ad417ed View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    b4c698e View commit details
    Browse the repository at this point in the history
  14. Fix co-assignment logic to consider queued tasks

    When there were multiple root task groups, we were just re-using the last worker for every batch because it had nothing processing on it.
    
    Unintentionally this also fixes dask#6597 in some cases (because the first task goes to processing, but we measure queued, so we pick the same worker for both task groups)
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    aa4e531 View commit details
    Browse the repository at this point in the history
  15. Revert "unused: OrderedSet collection"

    This reverts commit fdd5fd9.
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    b514e84 View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    1835a89 View commit details
    Browse the repository at this point in the history
  17. WIP identify root task families

    1. The family metric itself is flawed. Added linear chain traversal, but it's still not good. The maxsize is problematic and probably the wrong way to think about it? a) there's quite likely no maxsize parameter that will ever be right, because you could always have multiple independent crazy substructures that are each maxsize+1. b) even when every task would be in the same family because they're all interconnected, there's still benefit to scheduling subsequent things together, even if you do partition. Minimizing priority partitions is always what you want. Maybe there's something where maxsize is not a hard cutoff, but a cutoff for where to split up interconnected structures?
    2. Families probably need to be data structures? When a task completes, you'd like to know if it belongs to a family that actually has more tasks to run on that worker, vs the task just happens to look like it belongs to a family but was never scheduled as a rootish task.
    
    Overall I like the family structure for scheduling up/down scaling, but figuring out how to identify them is tricky. Partitioning priority order is great because it totally avoids this problem, of course at the expense of scaling. Can we combine priority and graph structure to identify isolated families when reasonable, partition on priority when not?
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    db42c22 View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    0f6603c View commit details
    Browse the repository at this point in the history
  19. Configuration menu
    Copy the full SHA
    e10fdca View commit details
    Browse the repository at this point in the history
  20. Configuration menu
    Copy the full SHA
    3eb1d68 View commit details
    Browse the repository at this point in the history
  21. Update check_idle_saturated

    Update docstring and add back logic for queuing disabled case
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    c685b3c View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    e1dda98 View commit details
    Browse the repository at this point in the history
  23. Tests for HeapSet.topk

    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    f811246 View commit details
    Browse the repository at this point in the history
  24. fix mypy

    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    d347b32 View commit details
    Browse the repository at this point in the history
  25. worker-oversaturation -> worker-saturation

    Just easier to explain this way
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    1990dd7 View commit details
    Browse the repository at this point in the history
  26. Configuration menu
    Copy the full SHA
    be1b9ca View commit details
    Browse the repository at this point in the history
  27. Configuration menu
    Copy the full SHA
    85f9120 View commit details
    Browse the repository at this point in the history
  28. Fix occupancy tests

    I think this fix is reasonable? I wonder if occupancy should include queued tasks though?
    gjoseph92 committed Jun 22, 2022
    Configuration menu
    Copy the full SHA
    bb08c8d View commit details
    Browse the repository at this point in the history

Commits on Jun 23, 2022

  1. Test releasing previously queued paused tasks

    Tasks shouldn't be both `no-worker` and in the queue. If all workers are paused, tasks will currently to go `no-worker`, even if they're queued. If we then try to schedule them (because a slot opens up from task completion, tasks released, new worker joining, etc.) we find an invalid state.
    gjoseph92 committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    966d61f View commit details
    Browse the repository at this point in the history
  2. driveby: fix transition debug log end state

    This was logging the actual end state, instead of the recommended end state
    gjoseph92 committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    15494f0 View commit details
    Browse the repository at this point in the history
  3. Refactor scheduling when no workers are running

    If all workers were paused, we would put tasks in the `no-worker` state. Now that `queued` is a thing, we want queued tasks in this case to just stay on the queue, and not be added to `unrunnable`.
    
    This commit takes the opposite of @crusaderky's view in https://github.com/dask/distributed/pull/5665/files#r787886583, and makes `idle` always a subset of `running`. Even if pedantically, the name `idle` isn't quite accurate, `idle` is typically _used_ as the set of "prime candidate for new tasks", so we make it that way.
    
    We do this to maintain the invariant that `valid_workers` always returns None if the task doesn't have restrictions. Our root task detection logic relied on this, as did the `not ts.loose_restrictions` check. Otherwise, when some workers are paused, root tasks will no longer be scheduled in the typical way.
    
    There are other approaches here which might be simpler, which I'll explore in following commits.
    gjoseph92 committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    546aa4a View commit details
    Browse the repository at this point in the history
  4. Don't send queued tasks to no-worker

    A way more minimal fix than 5b9d825afb9ab3a61ab22afef3b047dde238bc5f, but not ideal because if only some workers are paused, we'll get root task overproduction on the others (because having `valid_workers` bypasses the root task detection logic).
    gjoseph92 committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    ffbb53b View commit details
    Browse the repository at this point in the history
  5. Schedule rootish tasks when some workers are paused

    `valid_workers` will return a set if some workers are paused, even if the task doesn't have restrictions. This is anoying and a bit misleading, but possibly less intrusive of a change than 5b9d825afb9ab3a61ab22afef3b047dde238bc5f?
    gjoseph92 committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    65735f8 View commit details
    Browse the repository at this point in the history
  6. Rever less-intrusive all-paused handling

    I think overall excluding paused workers from idle is just more sensible. As is having `valid_workers` not concern itself with `running` workers. `valid_workers` should only deal in _task-specific_ restrictions, not restrictions that would apply to all tasks.
    
    This reverts commits 65735f8, ffbb53b.
    gjoseph92 committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    6bf710c View commit details
    Browse the repository at this point in the history
  7. Decrease test_root_task_overproduction size

    Workers seem to be running out of memory on CI. Probably different base unmanaged memory sizes than my machine. This is tricky.
    gjoseph92 committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    25e6f3b View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    988b0cf View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    b86fe0f View commit details
    Browse the repository at this point in the history
  10. Fix co-assignment for binary operations

    Bit of a hack, but closes dask#6597. I'd like to have a better metric for the batch size, but I think this is about as good as we can get. Any reasonably large number will do here.
    gjoseph92 committed Jun 23, 2022
    Configuration menu
    Copy the full SHA
    7ebd1d9 View commit details
    Browse the repository at this point in the history

Commits on Jun 24, 2022

  1. Turn withholding off by default

    Want to see if CI passes. This would be retaining current scheduling behavior. Task withholding would be behind a feature flag.
    gjoseph92 committed Jun 24, 2022
    Configuration menu
    Copy the full SHA
    034f980 View commit details
    Browse the repository at this point in the history

Commits on Aug 17, 2022

  1. Configuration menu
    Copy the full SHA
    0af53b4 View commit details
    Browse the repository at this point in the history
  2. Remove redundant insert into idle

    Already covered by `if p < nc` in `check_idle_saturated`. But the one removed here didn't check for `status == Status.running`
    gjoseph92 committed Aug 17, 2022
    Configuration menu
    Copy the full SHA
    a63d25b View commit details
    Browse the repository at this point in the history
  3. Update idle docstring

    gjoseph92 committed Aug 17, 2022
    Configuration menu
    Copy the full SHA
    5f7e7f1 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    9aeecc9 View commit details
    Browse the repository at this point in the history
  5. fix test_saturation_factor

    Not sure why these numbers changed
    gjoseph92 committed Aug 17, 2022
    Configuration menu
    Copy the full SHA
    dcb11e4 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    7dfc83e View commit details
    Browse the repository at this point in the history

Commits on Aug 18, 2022

  1. fix progress_stream

    gjoseph92 committed Aug 18, 2022
    Configuration menu
    Copy the full SHA
    c99bbe8 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c1544f3 View commit details
    Browse the repository at this point in the history
  3. fix config json schema

    gjoseph92 committed Aug 18, 2022
    Configuration menu
    Copy the full SHA
    349712f View commit details
    Browse the repository at this point in the history
  4. fix retire workers

    gjoseph92 committed Aug 18, 2022
    Configuration menu
    Copy the full SHA
    594585e View commit details
    Browse the repository at this point in the history
  5. update validate_task_state

    gjoseph92 committed Aug 18, 2022
    Configuration menu
    Copy the full SHA
    b4f843d View commit details
    Browse the repository at this point in the history
  6. fix test_saturation_factor again

    Apparently they're just unpredictable
    gjoseph92 committed Aug 18, 2022
    Configuration menu
    Copy the full SHA
    2db4db9 View commit details
    Browse the repository at this point in the history

Commits on Aug 19, 2022

  1. Configuration menu
    Copy the full SHA
    8395ef4 View commit details
    Browse the repository at this point in the history
  2. hackily consider queue in adaptive target

    TODO this is one of the main things unhandled in this PR: how do we address occupancy? Do queued tasks contribute to total occupancy or not? In either case, how is that implemented?? (I don't really want to make a `queued_occ` dict tracking per-task occupancy, like we have for processing; that feels like overkill.)
    gjoseph92 committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    36a60a5 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    da04438 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    c92236c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    e990b92 View commit details
    Browse the repository at this point in the history
  6. correct bulk_schedule comment

    I mistakenly thought that in the transitions loop, new recommendations were processed after old ones. I believe it's the opposite (`dict.update` will add the new items at the end, `dict.popitem` will pop those new items off the end).
    
    It wouldn't be too hard to sort all the recommendations here, just some extra allocations and copies.
    gjoseph92 committed Aug 19, 2022
    Configuration menu
    Copy the full SHA
    0d21c78 View commit details
    Browse the repository at this point in the history

Commits on Aug 23, 2022

  1. Configuration menu
    Copy the full SHA
    b40cec1 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    12b94d0 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4c0e768 View commit details
    Browse the repository at this point in the history

Commits on Aug 24, 2022

  1. yaml schema fixes

    Co-authored-by: crusaderky <[email protected]>
    gjoseph92 and crusaderky authored Aug 24, 2022
    Configuration menu
    Copy the full SHA
    14cc157 View commit details
    Browse the repository at this point in the history
  2. topk -> peekn

    The previous naming and docstring was just wrong.
    gjoseph92 committed Aug 24, 2022
    Configuration menu
    Copy the full SHA
    f5d7be4 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    704b485 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    38e0598 View commit details
    Browse the repository at this point in the history
  5. Split up decide_worker, remove recs

    This overhauls `decide_worker` into separate methods for different
    cases.
    
    More importantly, it explicitly turns `transition_waiting_processing`
    into the primary dispatch mechanism for ready tasks.
    
    All ready tasks (deps in memory) now always get recommended to
    processing, regardless of whether there are any workers in the cluster,
    whether the have restrictions, whether they're root-ish, etc.
    
    `transition_waiting_processing` then decides how to handle them
    (depending on whether they're root-ish or not), and calls the
    appropriate `decide_worker` method to search for a worker.
    
    If a worker isn't available, then it recommends them off to `queued` or
    `no-worker` (depending, again, on whether they're root-ish and the
    WORKER_SATURATION setting).
    
    This also updates the `no-worker` state to better match `queued`.
    Before, `bulk_schedule_after_adding_worker` would send `no-worker` tasks
    to `waiting`, which would then send them to `processing`. This was
    weird, because in order to be in `no-worker`, they should already be ready
    to run (just in need of a worker). So going straight to `processing` makes
    more sense than sending a ready task back to waiting.
    
    Finally, this adds a `SchedulerState.is_rootish` helper. Not quite the
    static field on a task @fjetter wants in dask#6922, but a step in that
    direction.
    gjoseph92 committed Aug 24, 2022
    Configuration menu
    Copy the full SHA
    d47e80d View commit details
    Browse the repository at this point in the history
  6. remove no_worker->memory just to see what happens

    The only valid way I can imagine any of these happening is `client.scatter` within a worker. If this is actually needed, I guess I should add an equivalent for queued?
    gjoseph92 committed Aug 24, 2022
    Configuration menu
    Copy the full SHA
    2cc8631 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    100118a View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    842ee71 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    494fe48 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    dd88b0d View commit details
    Browse the repository at this point in the history
  11. Revert "remove no_worker->memory just to see what happens"

    This reverts commit 2cc8631.
    gjoseph92 committed Aug 24, 2022
    Configuration menu
    Copy the full SHA
    e17c624 View commit details
    Browse the repository at this point in the history

Commits on Aug 25, 2022

  1. Configuration menu
    Copy the full SHA
    06d60fe View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    96d59eb View commit details
    Browse the repository at this point in the history
  3. test_root_task_overproduction adaptive data size

    Still maybe not a test that should run in CI, I just like how real-world it is. Let's see if picking the task size based on available memory helps on windows.
    gjoseph92 committed Aug 25, 2022
    Configuration menu
    Copy the full SHA
    3240a43 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    9344dd9 View commit details
    Browse the repository at this point in the history

Commits on Aug 26, 2022

  1. improve test_queued_paused

    gjoseph92 committed Aug 26, 2022
    Configuration menu
    Copy the full SHA
    aa8e1db View commit details
    Browse the repository at this point in the history
  2. test_queued_paused_unpaused

    gjoseph92 committed Aug 26, 2022
    Configuration menu
    Copy the full SHA
    ee1a754 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    f36a6ac View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    78353e1 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    18b7bb5 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    f3a66df View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    c5f2746 View commit details
    Browse the repository at this point in the history
  8. don't need that fail_func

    gjoseph92 committed Aug 26, 2022
    Configuration menu
    Copy the full SHA
    4b2a209 View commit details
    Browse the repository at this point in the history
  9. remove test_oversaturation_multiple_task_groups

    will add it back when we actually implement co-assignment
    gjoseph92 committed Aug 26, 2022
    Configuration menu
    Copy the full SHA
    3cebe54 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    51dca31 View commit details
    Browse the repository at this point in the history

Commits on Aug 27, 2022

  1. Documentation suggestions

    Co-authored-by: crusaderky <[email protected]>
    gjoseph92 and crusaderky authored Aug 27, 2022
    Configuration menu
    Copy the full SHA
    14dc850 View commit details
    Browse the repository at this point in the history
  2. test_graph_execution_width

    gjoseph92 committed Aug 27, 2022
    Configuration menu
    Copy the full SHA
    b36064e View commit details
    Browse the repository at this point in the history
  3. skip test_root_task_overproduction on windows

    I don't understand why it's flaking on windows, but I imagine it's just because memory measurement and process memory overhead behaves differently. It could really just run on linux, but leaving it un-skipped for macOS right now out of convenience for macOS developers to run locally.
    gjoseph92 committed Aug 27, 2022
    Configuration menu
    Copy the full SHA
    5b2bc02 View commit details
    Browse the repository at this point in the history

Commits on Aug 29, 2022

  1. Configuration menu
    Copy the full SHA
    1819a51 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2952f6b View commit details
    Browse the repository at this point in the history
  3. decide_worker_rootish_queuing_enabled no task

    don't even need to pass it in right now; it's not used
    gjoseph92 committed Aug 29, 2022
    Configuration menu
    Copy the full SHA
    d00ea54 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    8ba4ced View commit details
    Browse the repository at this point in the history

Commits on Aug 30, 2022

  1. Configuration menu
    Copy the full SHA
    63d863d View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b7704e3 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    00b54e7 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    12207e6 View commit details
    Browse the repository at this point in the history

Commits on Aug 31, 2022

  1. Configuration menu
    Copy the full SHA
    02c98b3 View commit details
    Browse the repository at this point in the history
  2. remove test_near_memory_limit_workload

    feeling pretty good about just `test_graph_execution_width`
    gjoseph92 committed Aug 31, 2022
    Configuration menu
    Copy the full SHA
    2b3f6ae View commit details
    Browse the repository at this point in the history
  3. handle_worker_status_change in retire_workers

    Using it as an API saves having to manage `running` and `idle` in multiple places
    gjoseph92 committed Aug 31, 2022
    Configuration menu
    Copy the full SHA
    5e4d53d View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    333bbb2 View commit details
    Browse the repository at this point in the history
  5. avoid flaky test_graph_execution_width

    hesitant on this, but I don't want to introduce a flaky test
    gjoseph92 committed Aug 31, 2022
    Configuration menu
    Copy the full SHA
    acc524f View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    ba336b9 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    9d99d74 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    093d7dc View commit details
    Browse the repository at this point in the history