-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks lost by cluster rescale during stealing #3892
Comments
cc @leej3 who finished up #3069 One thing people could do here is to verify that it is impossible for the victim of work stealing (also called @bnaul on your end it might be useful to learn more about what happened to |
Thanks @mrocklin, I'll take a look at that the next time this happens. The context here is that this job frequently causes workers to OOM and get evicted, so many will be killed in one swoop of the GKE orchestrator. One other thing that I hadn't noticed is that the worker logs show a lot of things like
which seems like a more straightforward error but not one that I would think would cause tasks to be lost.. |
Is there any workaround for this issue or a way to debug what is happening? This happens quite consistently for me. |
@amcpherson for what it's worth I have not seen this error in quite some time...while we do still see occasional scheduler-worker communication lapses like this, this specific |
If the issue still occurs a new traceback and the corresponding version number would be of help. We also added some functionality to extract a dump of the entire cluster state to allow debugging, see http://distributed.dask.org/en/stable/api.html#distributed.Client.dump_cluster_state If it's not the same error, I would prefer us opening a new ticket (and closing this one) |
I have had some successful runs after having increased many of the timeouts. I will update if I see the error again, thanks! |
Seems similar to to #3256 which was eventually fixed by #3321: we're now seeing the following scheduler logs:
Looks possibly related to #3069 based on the lines that were changed there (our tasks do not use resources though). I believe the error occurs when a worker goes down while stealing is underway but it's not easy to reproduce w/o a very large job.
Weirdly I'm actually seeing two different symptoms, which might mean there are actually two bugs here:
processing
, but the worker call stacks are empty and nothing ever happenswaiting
indefinitely and never reach theprocessing
phase at allcc @seibert from that PR and also @fjetter from #3619 just in case either of y'all have any theories 🙂
The text was updated successfully, but these errors were encountered: