Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_transition_from_resumed contains legacy code and documentation #6693

Closed
hendrikmakait opened this issue Jul 8, 2022 · 1 comment · Fixed by #6699
Closed

_transition_from_resumed contains legacy code and documentation #6693

hendrikmakait opened this issue Jul 8, 2022 · 1 comment · Fixed by #6699
Assignees

Comments

@hendrikmakait
Copy link
Member

_transition_from_resumed should not be reachable from a code path where ts._previous == "executing". Yet, it states this in the docstring and performs checks for this behavior. We should revisit the code and explanation and clean it up to improve readability.

@crusaderky
Copy link
Collaborator

crusaderky commented Jul 9, 2022

These three are really the same issue:

if ts._previous == finish:
# We're back where we started. We should forget about the entire
# cancellation attempt
ts.state = finish
ts._next = None
ts._previous = None
elif not ts.done:
# If we're not done, yet, just remember where we want to be next
ts._next = finish
else:
# Flight/executing finished unsuccessfully, i.e. not in memory
assert finish != "memory"
next_state = ts._next
assert next_state in {"waiting", "fetch"}, next_state
assert ts._previous in {"executing", "long-running", "flight"}, ts._previous
if next_state != finish:
recs, instructions = self._transition_generic_released(
ts, stimulus_id=stimulus_id
)
recs[ts] = next_state

Lines 1949-1953 are unreachable, because if ts._previous == finish: does not do what the author thought:

  • ts._previous is either executing, long-running, or flight
  • finish is either waiting, or fetch, or missing

As a side note, I'm unsure how a transition resumed->missing can ever happen, but I'm fairly sure there's no unit test for it.

As a consequence, the use case

  • flight->cancelled->resumed(waiting)->fetch
  • executing->cancelled->resumed(fetch)->executing
  • long-running->cancelled->resumed(fetch)->long-running

is unwittingly covered by lines 1958+.

Those lines do not cover the use case # Flight/executing finished unsuccessfully, i.e. not in memory.

  • If executing fails, the transition resumed->error is performed by _transition_generic_error, which causes all sorts of problems (Deadlock when a flaky task is stolen #6689).
  • if flight fails with GatherDepFailureEvent, you also have the same _transition_generic_error which... might? ... work as intended (missing a test).
  • if flight fails with GatherDepNetworkFailureEvent or GatherDepSuccessEvent (but no data), then you do have a transition to executing served by lines 1958+, and that's tested by test_workerstate_flight_failure_to_executing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants