Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks lost by cluster rescale during stealing #3892

Open
bnaul opened this issue Jun 13, 2020 · 6 comments
Open

Tasks lost by cluster rescale during stealing #3892

bnaul opened this issue Jun 13, 2020 · 6 comments

Comments

@bnaul
Copy link
Contributor

bnaul commented Jun 13, 2020

Seems similar to to #3256 which was eventually fixed by #3321: we're now seeing the following scheduler logs:

tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealinTraceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 391, in balance
    level, ts, sat, thief, duration, cost_multiplier
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 291, in maybe_move_task
    self.move_task_request(ts, sat, idl)
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 167, in move_task_request
    self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://10.24.81.35:39991'

Looks possibly related to #3069 based on the lines that were changed there (our tasks do not use resources though). I believe the error occurs when a worker goes down while stealing is underway but it's not easy to reproduce w/o a very large job.

Weirdly I'm actually seeing two different symptoms, which might mean there are actually two bugs here:

  • sometimes the tasks show up in the worker info page as processing, but the worker call stacks are empty and nothing ever happens
  • sometimes the tasks simply show as waiting indefinitely and never reach the processing phase at all

cc @seibert from that PR and also @fjetter from #3619 just in case either of y'all have any theories 🙂

@mrocklin
Copy link
Member

cc @leej3 who finished up #3069

One thing people could do here is to verify that it is impossible for the victim of work stealing (also called sat for saturated in that code) must be in s.streams_comms. I started going through this and things seemed ok to me, but I was brief and more eyes here would be good.

@bnaul on your end it might be useful to learn more about what happened to tcp://10.24.81.35:39991? You may be interested in checking out dask_scheduler.events["tcp://10.24.81.35:39991"]

@bnaul
Copy link
Contributor Author

bnaul commented Jun 14, 2020

Thanks @mrocklin, I'll take a look at that the next time this happens. The context here is that this job frequently causes workers to OOM and get evicted, so many will be killed in one swoop of the GKE orchestrator.

One other thing that I hadn't noticed is that the worker logs show a lot of things like

2020-06-13T05:26:10.248029950Z: brett-tnc-reroute-daskworkers-68ccff6949-nbmsp distributed.worker - ERROR - Worker stream died during communication: tcp://10.24.2.14:45539

OSError: Timed out trying to connect to 'tcp://10.24.2.14:45539' after 300 s: in <distributed.comm.tcp.TCPConnector object at 0x7ff21441f7d0>: ConnectionRefusedError: [Errno 111] Connection refused
    raise IOError(msg)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 215, in _raise
    _raise(error)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 234, in connect
2020-06-13T05:26:10.248079428Z: brett-tnc-reroute-daskworkers-68ccff6949-nbmsp Traceback (most recent call last):

OSError: Timed out trying to connect to 'tcp://10.24.2.14:45539' after 300 s: Timed out trying to connect to 'tcp://10.24.2.14:45539' after 300 s: in <distributed.comm.tcp.TCPConnector object at 0x7ff21441f7d0>: ConnectionRefusedError: [Errno 111] Connection refused
    raise IOError(msg)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 215, in _raise
    _raise(error)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 245, in connect
    **self.connection_args,
  File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 958, in connect
    comm = await rpc.connect(worker)
  File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 3228, in _get_data
    return await coro()
  File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
    operation=operation,
  File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 3251, in get_data_from_worker
    self.rpc, deps, worker, who=self.address
  File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 1980, in gather_dep
2020-06-13T05:26:10.248161480Z: brett-tnc-reroute-daskworkers-68ccff6949-nbmsp Traceback (most recent call last):

which seems like a more straightforward error but not one that I would think would cause tasks to be lost..

@amcpherson
Copy link

Is there any workaround for this issue or a way to debug what is happening? This happens quite consistently for me.

@bnaul
Copy link
Contributor Author

bnaul commented Jan 14, 2022

@amcpherson for what it's worth I have not seen this error in quite some time...while we do still see occasional scheduler-worker communication lapses like this, this specific KeyError no longer appears. If you're seeing this on a recent version of distributed then any other details you can provide would be more relevant at this point than my now-pretty-old traceback.

@fjetter
Copy link
Member

fjetter commented Jan 18, 2022

If the issue still occurs a new traceback and the corresponding version number would be of help. We also added some functionality to extract a dump of the entire cluster state to allow debugging, see http://distributed.dask.org/en/stable/api.html#distributed.Client.dump_cluster_state

If it's not the same error, I would prefer us opening a new ticket (and closing this one)

@amcpherson
Copy link

I have had some successful runs after having increased many of the timeouts. I will update if I see the error again, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants