Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a deadlock connected to task stealing and task deserialization #5128

Merged
merged 4 commits into from
Jul 30, 2021

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jul 27, 2021

If a task is stolen while the task runspec is being deserialized this allows
for an edge case where the executing_count is never decreased again
such that the ready queue is never worked off

The second commit refactors the Worker.execute exception handling a bit and covers everything in a single try/except. I removed the BaseException catch for the process pool since I do not think this is necessary. If users actually want to sys.exit who are we to judge? In real world examples if a process gets killed this is typically done by the OS using a signal like SIGINT or SIGTERM and for these situations the behaviour of such a processpool is actually quite different. the processpool ends up completely broken but I don't see anything we can do about this right now. See also #5075

Closes #5119

This issue is not a direct cause but this PR removes the only code path where the mentioned transition could even occur.

If a task is stolen while the task runspec is being deserialized this allows
for an edge case where the executing_count is never decreased again
such that the ready queue is never worked off
@fjetter
Copy link
Member Author

fjetter commented Jul 27, 2021

FYI exception handling is not dealt with properly, yet, but that's easy to do. will deal with this tomorrow Done

@fjetter fjetter marked this pull request as ready for review July 28, 2021 13:41
@fjetter fjetter changed the title Fix a deadlock connected to task stealing and deserialization Fix a deadlock connected to task stealing and task deserialization Jul 28, 2021
@mrocklin
Copy link
Member

I removed the BaseException catch for the process pool since I do not think this is necessary. If users actually want to sys.exit who are we to judge? In real world examples if a process gets killed this is typically done by the OS using a signal like SIGINT or SIGTERM and for these situations the behaviour of such a processpool is actually quite different. the processpool ends up completely broken but I don't see anything we can do about this right now. See also #5075

So what is the behavior if the underlying task does something like segfaults? And what was it before?

@fjetter
Copy link
Member Author

fjetter commented Jul 30, 2021

So what is the behavior if the underlying task does something like segfaults? And what was it before?

segfaults will tear down the interpreter immediately at C level. The process will be killed by the OS and there is nothing we can do about this. this will not result in a BaseException but in a BrokenProcessPool exception. In this case the pool will be unusable and the only sane thing we can do is to shutdown the worker. However, this is something I didn't implement since I don't know right now how we want to deal with this special case. This is tested, though, since it is almost the same thing as for SIGINT/SIGTERM

sys.exit otoh, raises a BaseException which is usually not caught by ordinary exception handling. For a threadpool executor this will simply bubble up the stack and terminate the python interpreter, allowing for all sorts of finally clauses to still be executed.

for processpools behaviour:

Previous (on main): The worker catches this BaseException and logs it as an ordinary user exception
Now: Terminates the worker, same as for the threadpool executor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deadlock The cluster appears to not make any progress stealing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2021.7.1 hangs when executing a from_pandas task with distributed.worker - ERROR - ('memory', 'executing')
2 participants