Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix transfer limiting in _select_keys_for_gather #7071

Merged
merged 12 commits into from
Sep 27, 2022

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Sep 26, 2022

Closes

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait hendrikmakait marked this pull request as ready for review September 26, 2022 19:26
@github-actions
Copy link
Contributor

github-actions bot commented Sep 26, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  +       1         15 suites  +1   6h 7m 18s ⏱️ + 30m 14s
  3 122 tests +       6    3 036 ✔️ +     17    85 💤  - 11  1 ±0 
23 106 runs  +1 281  22 198 ✔️ +1 341  907 💤  - 60  1 ±0 

For more details on these failures, see this check.

Results for commit ff4c4ed. ± Comparison against base commit 8c4133c.

♻️ This comment has been updated with latest results.

@hendrikmakait
Copy link
Member Author

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can confirm that this change fixes the performance regression we observe.

Mostly minor comments to do with naming.

distributed/worker_state_machine.py Outdated Show resolved Hide resolved
distributed/worker_state_machine.py Outdated Show resolved Hide resolved
distributed/worker_state_machine.py Outdated Show resolved Hide resolved
distributed/worker_state_machine.py Outdated Show resolved Hide resolved
distributed/worker_state_machine.py Outdated Show resolved Hide resolved
distributed/worker_state_machine.py Outdated Show resolved Hide resolved
distributed/worker_state_machine.py Outdated Show resolved Hide resolved
self.transfer_incoming_bytes
or to_gather
) and total_nbytes + ts.get_nbytes() > bytes_left_to_fetch:
if self._task_exceeds_transfer_limits(ts, total_nbytes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but it seems that this gather logic is somewhat eager to terminate. Suppose we have many tasks available, a few already in flight, and the top priority task is large (will exceed transfer limits). We'll never go further through the available list to check if we could actually eagerly transfer some lower-priority (but smaller) tasks.

This may be deliberate, but not sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's a rather aggressive stop criterion. The main reason we have this is that I haven't seen any evidence that would support a more elaborate one. The alternative would be to implement some form of looking at what's behind the task that exceeds the limits (or a completely different algorithm for packing). Depending on how that's done, this might lead to a lot of unnecessary work. Suppose we have many tasks available and they are all roughly equally-sized. Chances are pretty high that if the top-priority task doesn't fit into the message anymore, none of the following ones would. I'm happy to iterate on this assuming it's becoming an actual problem.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor nits, otherwise looks great, thanks for the quick fix!

distributed/worker_state_machine.py Outdated Show resolved Hide resolved
distributed/tests/test_worker_state_machine.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hendrikmakait and @wence-

@gjoseph92 gjoseph92 merged commit a007345 into dask:main Sep 27, 2022
gjoseph92 pushed a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants