Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another deadlock in the preamble of WorkerState.execute #6869

Closed
crusaderky opened this issue Aug 10, 2022 · 4 comments · Fixed by #6878
Closed

Another deadlock in the preamble of WorkerState.execute #6869

crusaderky opened this issue Aug 10, 2022 · 4 comments · Fixed by #6878
Assignees
Labels
deadlock The cluster appears to not make any progress

Comments

@crusaderky
Copy link
Collaborator

This is tightly related with #6867 and dask/dask#9330.

There is a deadlock which is triggered by this code path:

if ts.state == "cancelled":
logger.debug(
"Trying to execute task %s which is not in executing state anymore",
ts,
)
return AlreadyCancelledEvent(key=ts.key, stimulus_id=stimulus_id)

which in turn triggers:

@_handle_event.register
def _handle_already_cancelled(self, ev: AlreadyCancelledEvent) -> RecsInstrs:
"""Task is already cancelled by the time execute() runs"""
# key *must* be still in tasks. Releasing it directly is forbidden
# without going through cancelled
ts = self.tasks.get(ev.key)
assert ts, self.story(ev.key)
ts.done = True
return {ts: "released"}, []

The deadlock should be reproducible as follows:

  1. handle_stimulus(ComputeTaskEvent(key="x")
    ts.state=executing; create asyncio task for Worker.execute
  2. handle_stimulus(FreeKeysEvent(keys=["x"])
    ts.state=cancelled
  3. await asyncio.sleep(0)
    Worker.execute runs and returns AlreadyCancelledEvent.
    This causes the _handle_stimulus_from_task callback to be appended to the end of the event loop.
    However, the test suite is before that in the event loop:
  4. handle_stimulus(ComputeTaskEvent(key="x")
    ts.state=resumed
  5. await ... (anything that releases the event loop)
    This runs _handle_stimulus_from_task,
    which runs _handle_already_cancelled,
    which returns {ts: "released"},
    which triggers the (resumed, released) transition,
    which sends the task to cancelled state, while the scheduler thinks it's running.

@fjetter @gjoseph92 my head is spinning.

@crusaderky crusaderky added the deadlock The cluster appears to not make any progress label Aug 10, 2022
@crusaderky crusaderky self-assigned this Aug 10, 2022
@fjetter
Copy link
Member

fjetter commented Aug 11, 2022

  1. The issue you are describing is caused by the way we constructed the state machine <-> event loop handoff in Worker.handle_stimulus which creates a task and _handle_stimulus_from_task is a callback that listens to the return value of the task which is again a stimulus. The issue occurs due to done callbacks being scheduled on the event loop and executed on the next tick and not being executed immediately. That allows an event to sneak in between execute finishing and us handling the result.
    Regardless of how the task state flow looks like, this is a design flaw since a major part of the refactor was about not allowing these kind of race conditions. I think asyncio doesn't give us the right tools to implement this properly. The only way I can think of is to change our handover interface such that the result is handled synchronously, e.g.
class Worker:
    async def execute(self, ..., result_handler):
        ...
        return result_handler(AlreadyCancelledEvent(...))

result_handler in this case would be a closure/partial that binds the state machine. Ugly but possible.

cc @graingert

  1. If I read the code correctly, we'd end up in a released/forgotten and not a cancelled state. Doesn't really matter, though, since the scheduler would still think it is being computed and we'd have a problem. I think there are a couple of ways around this, e.g. by making _handle_already_cancelled smarter. I'm not excited about this option since the refactor was intended to avoid this exact situation and I think we should fix 1.) by handling the result immediately

Overall, I believe we should just drop AlreadyCancelledEvent entirely. I don't think this generates a lot of value. The time window between scheduling the coroutine and scheduling the task on the threadpool is incredibly small and if we miss this time window its useless. We did the same thing for gather_dep where we removed a filter step at the beginning of the coro to reduce complexity. I think we should do the same here as well

crusaderky added a commit to crusaderky/distributed that referenced this issue Aug 11, 2022
@crusaderky
Copy link
Collaborator Author

Yes, I agree AlreadyCancelledEvent is superfluous

@crusaderky
Copy link
Collaborator Author

class Worker:
    async def execute(self, ..., result_handler):
        ...
        return result_handler(AlreadyCancelledEvent(...))

result_handler in this case would be a closure/partial that binds the state machine. Ugly but possible.

I'm strongly opposed to this. It would solve only 1 out of the 4 async points of Worker.execute (coroutine start, _maybe_deserialize_task, actual execution, and coroutine exit).

@graingert
Copy link
Member

graingert commented Aug 11, 2022

That allows an event to sneak in between execute finishing and us handling the result.

Where would that event come from? This is some code executing loop.call_soon during the last .send( of async def execute(?

The time window between scheduling the coroutine and scheduling the task on the threadpool is incredibly small

If the threadpool has an idle worker I'd except the threadpool to win a race with call_soon

I think doing something like:

async def await_then_call(async_fn, fn):
    fn(await async_fn())

create_task(await_then_call(execute, callback))

Rather than changing execute would be cleaner, if you do plan on doing that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants