Fix a deadlock connected to task stealing and task deserialization #5128

fjetter · 2021-07-27T17:56:23Z

If a task is stolen while the task runspec is being deserialized this allows
for an edge case where the executing_count is never decreased again
such that the ready queue is never worked off

The second commit refactors the Worker.execute exception handling a bit and covers everything in a single try/except. I removed the BaseException catch for the process pool since I do not think this is necessary. If users actually want to sys.exit who are we to judge? In real world examples if a process gets killed this is typically done by the OS using a signal like SIGINT or SIGTERM and for these situations the behaviour of such a processpool is actually quite different. the processpool ends up completely broken but I don't see anything we can do about this right now. See also #5075

Closes #5119

This issue is not a direct cause but this PR removes the only code path where the mentioned transition could even occur.

If a task is stolen while the task runspec is being deserialized this allows for an edge case where the executing_count is never decreased again such that the ready queue is never worked off

fjetter · 2021-07-27T18:01:00Z

~~FYI exception handling is not dealt with properly, yet, but that's easy to do. will deal with this tomorrow~~ Done

mrocklin · 2021-07-30T12:40:16Z

I removed the BaseException catch for the process pool since I do not think this is necessary. If users actually want to sys.exit who are we to judge? In real world examples if a process gets killed this is typically done by the OS using a signal like SIGINT or SIGTERM and for these situations the behaviour of such a processpool is actually quite different. the processpool ends up completely broken but I don't see anything we can do about this right now. See also #5075

So what is the behavior if the underlying task does something like segfaults? And what was it before?

fjetter · 2021-07-30T14:39:29Z

So what is the behavior if the underlying task does something like segfaults? And what was it before?

segfaults will tear down the interpreter immediately at C level. The process will be killed by the OS and there is nothing we can do about this. this will not result in a BaseException but in a BrokenProcessPool exception. In this case the pool will be unusable and the only sane thing we can do is to shutdown the worker. However, this is something I didn't implement since I don't know right now how we want to deal with this special case. This is tested, though, since it is almost the same thing as for SIGINT/SIGTERM

sys.exit otoh, raises a BaseException which is usually not caught by ordinary exception handling. For a threadpool executor this will simply bubble up the stack and terminate the python interpreter, allowing for all sorts of finally clauses to still be executed.

for processpools behaviour:

Previous (on main): The worker catches this BaseException and logs it as an ordinary user exception
Now: Terminates the worker, same as for the threadpool executor

Follow up to dask#5128

Follow up to #5128

Fix a deadlock connected to task stealing and deserialization

979131c

If a task is stolen while the task runspec is being deserialized this allows for an edge case where the executing_count is never decreased again such that the ready queue is never worked off

Simplify exception handling for Worker.execute

95adfd0

fjetter marked this pull request as ready for review July 28, 2021 13:41

fjetter changed the title ~~Fix a deadlock connected to task stealing and deserialization~~ Fix a deadlock connected to task stealing and task deserialization Jul 28, 2021

remove test about unintended behaviour

ee14d9d

This was referenced Jul 28, 2021

2021.7.1 hangs when executing a from_pandas task with distributed.worker - ERROR - ('memory', 'executing') #5119

Closed

Release (off cycle) dask/community#173

Closed

function naming fix

800ef93

fjetter merged commit 1999c15 into dask:main Jul 30, 2021

bsesar mentioned this pull request Jul 31, 2021

Writing CSV to disk hangs if the DataFrame is large-ish with a string (object) index (no errors in logs, not out of memory) dask/dask#7848

Closed

fjetter added a commit to fjetter/distributed that referenced this pull request Aug 2, 2021

Demote ensure_computing to function

7724c57

Follow up to dask#5128

This was referenced Aug 2, 2021

Demote ensure_computing to function #5153

Merged

Remove excessive timeout of test_steal_during_task_deserialization #5156

Merged

fjetter added a commit that referenced this pull request Aug 2, 2021

Demote Worker.ensure_computing to function (#5153)

eaf05ac

Follow up to #5128

gjoseph92 mentioned this pull request Mar 17, 2022

BaseException in task leads to task never completing #5958

Closed

fjetter added deadlock The cluster appears to not make any progress stealing labels Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a deadlock connected to task stealing and task deserialization #5128

Fix a deadlock connected to task stealing and task deserialization #5128

fjetter commented Jul 27, 2021 •

edited

Loading

fjetter commented Jul 27, 2021 •

edited

Loading

mrocklin commented Jul 30, 2021

fjetter commented Jul 30, 2021 •

edited

Loading

Fix a deadlock connected to task stealing and task deserialization #5128

Fix a deadlock connected to task stealing and task deserialization #5128

Conversation

fjetter commented Jul 27, 2021 • edited Loading

fjetter commented Jul 27, 2021 • edited Loading

mrocklin commented Jul 30, 2021

fjetter commented Jul 30, 2021 • edited Loading

fjetter commented Jul 27, 2021 •

edited

Loading

fjetter commented Jul 27, 2021 •

edited

Loading

fjetter commented Jul 30, 2021 •

edited

Loading