-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task stuck processing on non-existent worker #3256
Comments
That log message is under the condition if ws is not ts.processing_on: # someone else has this task
logger.info(
"Unexpected worker completed task, likely due to"
" work stealing. Expected: %s, Got: %s, Key: %s",
ts.processing_on,
ws,
key,
)
return {} We're doing an identity, rather than equality, check there. At a glance, I don't see any other places where we do that. cc @mrocklin if you have any guesses. |
I don't think this is connected to #3246. The issue there is connected to connection failures and occurs after the task is actually finished. |
You're probably right, but I would add that in this case the task also finishes and there's a connection failure (caused by the worker dying, but same difference right?). So at the very least it seems like that change might have an effect here. |
@bnaul how easy is it for you to reproduce this failure? If it's somewhat easy then you might want to try setting validate=True on the scheduler. That might help us to pinpoint the cause. (adding this as a configuration option here: #3258) If there is a task that is processing on a worker that doesn't exist then that's obviously a bug. We haven't seen bugs like this for a while, but I guess they still have to happen from time to time. Another action to diagnose this would be to turn off work-stealing and see if the problem goes away. That might help isolate things to work-stealing. |
Not easy at all unfortunately; it seems to be a pretty rare event (say once every 1000 hrs of worker compute time for our workload). I also suspect that work stealing is probably the culprit, but even after turning it off it'll probably be a while before I can say w/ much confidence that it's "resolved". Also happy to try some runs with |
Also @mrocklin we kept the deadlocked scheduler pod around to keep testing things, if there's any other information about the internal state that would be helpful let me know and we'll add it here. |
Well, if you wanted to you could look at |
Validate is expensive, but honestly I don't have a sense for how expensive when run in a real world setting. We mostly use this for testing, and once we've isolated things down to a small case. |
This did end up being fixed by #3321 for the record |
Woot
…On Sat, Jun 13, 2020 at 8:38 AM Brett Naul ***@***.***> wrote:
Closed #3256 <#3256>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3256 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEYHEKKU4GNXREWXHLRWOMQFANCNFSM4JQG3XHA>
.
|
At the end of long-running jobs, I'm often seeing one or two tasks that never finish. Looking a bit more closely, it seems that the task is
processing
on a non-existent worker:Even after closing every worker, the status of this task doesn't change.
Searching through the scheduler logs for this task I found:
Seems pretty fishy that a worker could steal a task from itself? 🤔
Some other context:
dask.as_completed(futures); del futures
Full scheduler logs related to this renegade worker, which has quite a 🎢 minute at 18:58:
This seems like it could be related to #3246 but I'm not sure, @fjetter any of this reminiscent of the behavior that prompted that PR?
The text was updated successfully, but these errors were encountered: