Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert that a fetch->cancelled->resumed->fetch cycle is impossible #6460

Merged
merged 1 commit into from
Jun 1, 2022

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented May 26, 2022

transition_resumed_fetch calls _transition_from_resumed, which in turn contains this:

        if ts._previous == finish:
            # We're back where we started. We should forget about the entire
            # cancellation attempt
            ts.state = finish

This would cause a deadlock in transition_resumed_fetch, as nothing would kick off _ensure_communicating and the task would potentially stay hanging in fetch state forever.

However, after further inspection of the code, I understand that the only possible values for ts._previous are None, executing, long-running, and flight.

@crusaderky crusaderky force-pushed the fetch_cancelled_fetch branch from d589739 to b5beb4c Compare May 26, 2022 15:11
@crusaderky crusaderky self-assigned this May 26, 2022
@crusaderky crusaderky requested a review from fjetter May 26, 2022 15:12
@pentschev
Copy link
Member

rerun tests

Note: rerunning gpuCI tests since those errors should be fixed by #6434

@github-actions
Copy link
Contributor

Unit Test Results

       15 files  ±0         15 suites  ±0   6h 11m 1s ⏱️ - 9m 44s
  2 813 tests ±0    2 731 ✔️  -   1    80 💤 ±  0  2 +1 
20 854 runs   - 6  19 918 ✔️ +51  934 💤  - 58  2 +1 

For more details on these failures, see this check.

Results for commit b5beb4c. ± Comparison against base commit e0ea5df.

@crusaderky crusaderky force-pushed the fetch_cancelled_fetch branch from 7a084a7 to 0f1b675 Compare June 1, 2022 14:32
@fjetter
Copy link
Member

fjetter commented Jun 1, 2022

However, after further inspection of the code, I understand that the only possible values for ts._previous are None, executing, long-running, and flight.

Yes, this should probably be asserted/validated as part of the if clause

@fjetter
Copy link
Member

fjetter commented Jun 1, 2022

How about adding

diff --git a/distributed/worker.py b/distributed/worker.py
index c306c5af5..73d519d57 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -4250,13 +4250,13 @@ class Worker(ServerNode):

     def validate_task_cancelled(self, ts):
         assert ts.key not in self.data
-        assert ts._previous
+        assert ts._previous in {"long-running", "executing", "flight"}
         assert ts._next is None  # We'll always transition to released after it is done

     def validate_task_resumed(self, ts):
         assert ts.key not in self.data
         assert ts._next
-        assert ts._previous
+        assert ts._previous in {"long-running", "executing", "flight"}

     def validate_task_released(self, ts):
         assert ts.key not in self.data

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this assert. Not sure how valuable it will be but I agree if it fails, it should fail hard.

I opened #6488 for the other suggestion such that we can merge this faster

@crusaderky crusaderky merged commit ebef7a4 into dask:main Jun 1, 2022
@crusaderky crusaderky deleted the fetch_cancelled_fetch branch June 1, 2022 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants