-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test_stress_scatter_death #6404
Fix test_stress_scatter_death #6404
Conversation
With the fixes on this PR, This should not block the release |
Unit Test Results 15 files ± 0 15 suites ±0 6h 28m 18s ⏱️ + 1m 56s For more details on these failures, see this check. Results for commit 853a59f. ± Comparison against base commit 7e49d88. ♻️ This comment has been updated with latest results. |
if ts.resource_restrictions is not None: | ||
if ts.state == "executing": | ||
for resource, quantity in ts.resource_restrictions.items(): | ||
self.available_resources[resource] += quantity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this transition specific logic from the method since this is already in place for all transitions specifically.
Everything else in this method is due to unindentation
async def test_resumed_cancelled_handle_compute( | ||
c, s, a, b, raise_error, wait_for_processing | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This covers the changes in handle_compute where I moved around task states in the switch statement and added an error handler that was missing before
rerun tests Note: rerunning gpuCI tests since those errors should be fixed by #6434 |
"""Test that it is OK for a dependency to be in state missing if a dependent is asked to be computed""" | ||
|
||
f1 = c.submit(inc, 1, key="f1", workers=[w1.address]) | ||
f2 = c.submit(inc, 2, key="f2", workers=[w1.address]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f2 is unnecessary for the purpose of this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to have multiple keys fetched from w1. I don't think this is absolutely required but I decided to keep this in the test
distributed/worker.py
Outdated
if key in self.data: | ||
del self.data[key] | ||
if key in self.actors: | ||
del self.actors[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if key in self.data: | |
del self.data[key] | |
if key in self.actors: | |
del self.actors[key] | |
self.data.pop(key, None) | |
self.actors.pop(key, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: self.data.pop may raise OSError if the key is spilled to disk and there is a disk malfunction. This should be treated as an exceptional, unrecoverable situation and is dealt with by the @fail_hard
decorator around handle_stimulus
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I decided to get rid of the FileNotFoundError
here. If this actually causes a problem, I prefer to fix this in our data/spilling layer.
f1.release() | ||
f2.release() | ||
f3.release() | ||
f4.release() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the scheduler going to send the release_key commands to the worker starting from the dependents and then descending into the dependencies, or in random order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It always walks back the dependency tree. There is some ambiguity on worker A since once we start computing f3
, the only reason why f1
is not released yet is because the client holds on to an explicit ref. Therefore, the order in which f1
and f2
are released on A depends on the order the client releases the keys.
However, this test focuses exclusively on worker B so I think it is fine to have this ambiguity in favor of test runtime and test complexity. If you think it matters, I can parametrize to switch this ordering.
About release vs. del
I use release
mostly because I like holding on to the future object to not depend on hard coded key names. That's a style choice. using del f1, f2, f3, f4
should have the exact same effect only that I am relying on the python ref counting to release them instead of releasing them explicitly. I like being explicit as well but that's again a style choice
This comment was marked as outdated.
This comment was marked as outdated.
0144bb2
to
b404dd4
Compare
self._notify_plugins( | ||
"release_key", key, state_before, cause, stimulus_id, report | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now gone
This comment was marked as outdated.
This comment was marked as outdated.
except InvalidTransition: | ||
# ValueError may be raised by merge_recs_instructions | ||
# TODO: should merge_recs raise InvalidTransition? | ||
except (ValueError, InvalidTransition): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like this. ValueError is too broad and generic to be handled here. I don't want to hold the PR up; 'll open a new one shortly to narrow it down.
Running test_scatter_death locally, I ran into three issues
transition_released_waiting
transition. This can cause aassert ts.who_has
failure.Closes #6305
Closes #6191
I'm pretty sure 1. and 2. have been around a while. No idea why this is more likely to fail lately. @crusaderky mentioned a potential connection to #6210 indicating that this always asserted/raised but we simply never noticed
I also am fairly certain that the system can recover from 2. if we were not to raise an assert statement and would not deadlock.
cc @crusaderky @jrbourbeau