-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Harden preamble of Worker.execute against race conditions #6878
Conversation
090902b
to
38bf365
Compare
# This is just for internal coherence of the WorkerState; the reschedule | ||
# message should not ever reach the Scheduler. | ||
# It is still OK if it does though. | ||
return RescheduleEvent(key=key, stimulus_id=f"worker-closing-{time()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two new tests remain green if you revert the first line of this block:
if self.status in {Status.closing, Status.closed, Status.closing_gracefully}:
return RescheduleEvent(key=key, stimulus_id=f"worker-closing-{time()}")
In this case, the reschedule event actually reaches the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this causes the scheduler to reschedule a task twice. First, because the worker left, and second, because the reschedule event arrives.
Why is the previous version of returning None
not sufficient?
Edit:
distributed/distributed/scheduler.py
Lines 6740 to 6741 in 75ef3c9
if worker and ts.processing_on.address != worker: | |
return |
"Trying to execute task %s which is not in executing state anymore", | ||
ts, | ||
) | ||
return AlreadyCancelledEvent(key=ts.key, stimulus_id=stimulus_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two new tests remain green if, instead of removing this block completely, you just replace the exit event with
return RescheduleEvent(key=key, stimulus_id=f"already-cancelled-{time()}")
They also remain green if you copy-paste the same block after the call to _maybe_deserialize_task
.
In both cases, the reschedule event reaches the scheduler and behaves as expected.
ts, | ||
) | ||
return AlreadyCancelledEvent(key=ts.key, stimulus_id=stimulus_id) | ||
if self.status not in WORKER_ANY_RUNNING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
align execute
and gather_dep
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 24m 56s ⏱️ - 8m 24s For more details on these failures, see this check. Results for commit 38bf365. ± Comparison against base commit 99a2db1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks @crusaderky !
# This is just for internal coherence of the WorkerState; the reschedule | ||
# message should not ever reach the Scheduler. | ||
# It is still OK if it does though. | ||
return RescheduleEvent(key=key, stimulus_id=f"worker-closing-{time()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that this causes the scheduler to reschedule a task twice. First, because the worker left, and second, because the reschedule event arrives.
Why is the previous version of returning None
not sufficient?
Edit:
distributed/distributed/scheduler.py
Lines 6740 to 6741 in 75ef3c9
if worker and ts.processing_on.address != worker: | |
return |
test_blockwise_concatenate
flaky dask#9330WorkerState.execute
#6869closing_gracefully
status, discussed in Deadlock when emerging from closing_gracefully #6867