-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a deadlock connected to task stealing and task deserialization #5128
Conversation
If a task is stolen while the task runspec is being deserialized this allows for an edge case where the executing_count is never decreased again such that the ready queue is never worked off
|
So what is the behavior if the underlying task does something like segfaults? And what was it before? |
segfaults will tear down the interpreter immediately at C level. The process will be killed by the OS and there is nothing we can do about this. this will not result in a BaseException but in a BrokenProcessPool exception. In this case the pool will be unusable and the only sane thing we can do is to shutdown the worker. However, this is something I didn't implement since I don't know right now how we want to deal with this special case. This is tested, though, since it is almost the same thing as for SIGINT/SIGTERM
for processpools behaviour: Previous (on main): The worker catches this BaseException and logs it as an ordinary user exception |
Follow up to dask#5128
If a task is stolen while the task runspec is being deserialized this allows
for an edge case where the executing_count is never decreased again
such that the ready queue is never worked off
The second commit refactors the Worker.execute exception handling a bit and covers everything in a single try/except. I removed the BaseException catch for the process pool since I do not think this is necessary. If users actually want to sys.exit who are we to judge? In real world examples if a process gets killed this is typically done by the OS using a signal like SIGINT or SIGTERM and for these situations the behaviour of such a processpool is actually quite different. the processpool ends up completely broken but I don't see anything we can do about this right now. See also #5075
Closes #5119
This issue is not a direct cause but this PR removes the only code path where the mentioned transition could even occur.