Tasks lost by cluster rescale during stealing #3892

bnaul · 2020-06-13T16:11:18Z

Seems similar to to #3256 which was eventually fixed by #3321: we're now seeing the following scheduler logs:

tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealinTraceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 391, in balance
    level, ts, sat, thief, duration, cost_multiplier
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 291, in maybe_move_task
    self.move_task_request(ts, sat, idl)
  File "/usr/local/lib/python3.7/site-packages/distributed/stealing.py", line 167, in move_task_request
    self.scheduler.stream_comms[victim.address].send(
KeyError: 'tcp://10.24.81.35:39991'

Looks possibly related to #3069 based on the lines that were changed there (our tasks do not use resources though). I believe the error occurs when a worker goes down while stealing is underway but it's not easy to reproduce w/o a very large job.

Weirdly I'm actually seeing two different symptoms, which might mean there are actually two bugs here:

sometimes the tasks show up in the worker info page as processing, but the worker call stacks are empty and nothing ever happens
sometimes the tasks simply show as waiting indefinitely and never reach the processing phase at all

cc @seibert from that PR and also @fjetter from #3619 just in case either of y'all have any theories 🙂

The text was updated successfully, but these errors were encountered:

mrocklin · 2020-06-13T21:47:51Z

cc @leej3 who finished up #3069

One thing people could do here is to verify that it is impossible for the victim of work stealing (also called sat for saturated in that code) must be in s.streams_comms. I started going through this and things seemed ok to me, but I was brief and more eyes here would be good.

@bnaul on your end it might be useful to learn more about what happened to tcp://10.24.81.35:39991? You may be interested in checking out dask_scheduler.events["tcp://10.24.81.35:39991"]

bnaul · 2020-06-14T01:45:57Z

Thanks @mrocklin, I'll take a look at that the next time this happens. The context here is that this job frequently causes workers to OOM and get evicted, so many will be killed in one swoop of the GKE orchestrator.

One other thing that I hadn't noticed is that the worker logs show a lot of things like

2020-06-13T05:26:10.248029950Z: brett-tnc-reroute-daskworkers-68ccff6949-nbmsp distributed.worker - ERROR - Worker stream died during communication: tcp://10.24.2.14:45539

OSError: Timed out trying to connect to 'tcp://10.24.2.14:45539' after 300 s: in <distributed.comm.tcp.TCPConnector object at 0x7ff21441f7d0>: ConnectionRefusedError: [Errno 111] Connection refused
    raise IOError(msg)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 215, in _raise
    _raise(error)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 234, in connect
2020-06-13T05:26:10.248079428Z: brett-tnc-reroute-daskworkers-68ccff6949-nbmsp Traceback (most recent call last):

OSError: Timed out trying to connect to 'tcp://10.24.2.14:45539' after 300 s: Timed out trying to connect to 'tcp://10.24.2.14:45539' after 300 s: in <distributed.comm.tcp.TCPConnector object at 0x7ff21441f7d0>: ConnectionRefusedError: [Errno 111] Connection refused
    raise IOError(msg)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 215, in _raise
    _raise(error)
  File "/usr/local/lib/python3.7/site-packages/distributed/comm/core.py", line 245, in connect
    **self.connection_args,
  File "/usr/local/lib/python3.7/site-packages/distributed/core.py", line 958, in connect
    comm = await rpc.connect(worker)
  File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 3228, in _get_data
    return await coro()
  File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 370, in retry
    operation=operation,
  File "/usr/local/lib/python3.7/site-packages/distributed/utils_comm.py", line 390, in retry_operation
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 3251, in get_data_from_worker
    self.rpc, deps, worker, who=self.address
  File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 1980, in gather_dep
2020-06-13T05:26:10.248161480Z: brett-tnc-reroute-daskworkers-68ccff6949-nbmsp Traceback (most recent call last):

which seems like a more straightforward error but not one that I would think would cause tasks to be lost..

amcpherson · 2022-01-13T22:30:55Z

Is there any workaround for this issue or a way to debug what is happening? This happens quite consistently for me.

bnaul · 2022-01-14T03:50:58Z

@amcpherson for what it's worth I have not seen this error in quite some time...while we do still see occasional scheduler-worker communication lapses like this, this specific KeyError no longer appears. If you're seeing this on a recent version of distributed then any other details you can provide would be more relevant at this point than my now-pretty-old traceback.

fjetter · 2022-01-18T12:54:19Z

If the issue still occurs a new traceback and the corresponding version number would be of help. We also added some functionality to extract a dump of the entire cluster state to allow debugging, see http://distributed.dask.org/en/stable/api.html#distributed.Client.dump_cluster_state

If it's not the same error, I would prefer us opening a new ticket (and closing this one)

amcpherson · 2022-01-26T18:58:41Z

I have had some successful runs after having increased many of the timeouts. I will update if I see the error again, thanks!

gjoseph92 mentioned this issue May 20, 2022

Worker addresses are treated as unique identifiers, but may not be #6392

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks lost by cluster rescale during stealing #3892

Tasks lost by cluster rescale during stealing #3892

bnaul commented Jun 13, 2020 •

edited

Loading

mrocklin commented Jun 13, 2020

bnaul commented Jun 14, 2020

amcpherson commented Jan 13, 2022

bnaul commented Jan 14, 2022

fjetter commented Jan 18, 2022

amcpherson commented Jan 26, 2022

Tasks lost by cluster rescale during stealing #3892

Tasks lost by cluster rescale during stealing #3892

Comments

bnaul commented Jun 13, 2020 • edited Loading

mrocklin commented Jun 13, 2020

bnaul commented Jun 14, 2020

amcpherson commented Jan 13, 2022

bnaul commented Jan 14, 2022

fjetter commented Jan 18, 2022

amcpherson commented Jan 26, 2022

bnaul commented Jun 13, 2020 •

edited

Loading