Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task stuck processing on non-existent worker #3256

Closed
bnaul opened this issue Nov 21, 2019 · 10 comments
Closed

Task stuck processing on non-existent worker #3256

bnaul opened this issue Nov 21, 2019 · 10 comments

Comments

@bnaul
Copy link
Contributor

bnaul commented Nov 21, 2019

At the end of long-running jobs, I'm often seeing one or two tasks that never finish. Looking a bit more closely, it seems that the task is processing on a non-existent worker:
image

Even after closing every worker, the status of this task doesn't change.

Searching through the scheduler logs for this task I found:

2019-11-20T18:58:16.118979408Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.187.2:42561, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 9507)

Seems pretty fishy that a worker could steal a task from itself? 🤔

Some other context:

  • k8s cluster w/ ~1000 workers (before scaling to investigate the deadlock), around 100k tasks total (50k of these pandas tasks, and a processing task for each partition)
  • Our workers run OOM and get evicted somewhat frequently (every few hrs on average), and we use a lot of retries to compensate
  • We GC futures as we go using dask.as_completed(futures); del futures

Full scheduler logs related to this renegade worker, which has quite a 🎢 minute at 18:58:

2019-11-20T18:46:57.497189783Z distributed.scheduler - INFO - Register tcp://10.47.187.2:42561
2019-11-20T18:46:57.497605833Z distributed.scheduler - INFO - Starting worker compute stream, tcp://10.47.187.2:42561
2019-11-20T18:57:57.392080977Z distributed.scheduler - INFO - Remove worker tcp://10.47.187.2:42561
2019-11-20T18:57:57.392125386Z distributed.core - INFO - Removing comms to tcp://10.47.187.2:42561
2019-11-20T18:58:01.058431284Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.193.16:36713, Got: tcp://10.47.187.2:42561, Key: ('to_records-ff4aa1a02b819ae4fa0769294d47599b', 21371)
2019-11-20T18:58:01.059278416Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.7:38975, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 18520)
2019-11-20T18:58:01.059309058Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.197.2:41001, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 4917)
2019-11-20T18:58:01.05939125Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.191.7:39843, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 38984)
2019-11-20T18:58:01.059907822Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.197.16:43383, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 10328)
2019-11-20T18:58:01.060021135Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.191.4:45047, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 5609)
2019-11-20T18:58:01.060103572Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.189.7:33727, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 20321)
2019-11-20T18:58:01.060238991Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.196.17:46667, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 35456)
2019-11-20T18:58:01.060305807Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.190.3:44043, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 5880)
2019-11-20T18:58:01.060395695Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.197.9:34817, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 11345)
2019-11-20T18:58:01.060545848Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.190.4:40125, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 3060)
2019-11-20T18:58:01.060574243Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.11:45013, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 7430)
2019-11-20T18:58:01.060654418Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.12:43271, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 1786)
2019-11-20T18:58:01.06072858Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.196.16:36097, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 16601)
2019-11-20T18:58:01.060834564Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.199.2:34239, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 20394)
2019-11-20T18:58:01.06088221Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.197.17:34623, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 30949)
2019-11-20T18:58:01.06097191Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.6:45873, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 40418)
2019-11-20T18:58:01.06109409Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.198.13:38995, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 38647)
2019-11-20T18:58:01.061636897Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.191.5:37251, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 22807)
2019-11-20T18:58:01.064454821Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.199.9:44443, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 29398)
2019-11-20T18:58:01.064629746Z distributed.scheduler - INFO - Register tcp://10.47.187.2:42561
2019-11-20T18:58:01.064964628Z distributed.scheduler - INFO - Starting worker compute stream, tcp://10.47.187.2:42561
2019-11-20T18:58:16.118979408Z distributed.scheduler - INFO - Unexpected worker completed task, likely due to work stealing.  Expected: tcp://10.47.187.2:42561, Got: tcp://10.47.187.2:42561, Key: ('from_pandas-f8d9cff3cb70c117a611b7cf40be25d3', 9507)
2019-11-21T02:44:03.134375618Z distributed.scheduler - INFO - Remove worker tcp://10.47.187.2:42561
2019-11-21T02:44:03.134422097Z distributed.core - INFO - Removing comms to tcp://10.47.187.2:42561
2019-11-21T18:06:51.328850288Z distributed.utils - ERROR - Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: connect() didn't finish in time
2019-11-21T18:06:51.328916264Z OSError: Timed out trying to connect to 'tcp://10.47.187.2:42561' after 300 s: connect() didn't finish in time

This seems like it could be related to #3246 but I'm not sure, @fjetter any of this reminiscent of the behavior that prompted that PR?

@TomAugspurger
Copy link
Member

That log message is under the condition

            if ws is not ts.processing_on:  # someone else has this task
                logger.info(
                    "Unexpected worker completed task, likely due to"
                    " work stealing.  Expected: %s, Got: %s, Key: %s",
                    ts.processing_on,
                    ws,
                    key,
                )
                return {}

We're doing an identity, rather than equality, check there. At a glance, I don't see any other places where we do that.

cc @mrocklin if you have any guesses.

@fjetter
Copy link
Member

fjetter commented Nov 22, 2019

I don't think this is connected to #3246. The issue there is connected to connection failures and occurs after the task is actually finished.

@bnaul
Copy link
Contributor Author

bnaul commented Nov 22, 2019

You're probably right, but I would add that in this case the task also finishes and there's a connection failure (caused by the worker dying, but same difference right?). So at the very least it seems like that change might have an effect here.

@mrocklin
Copy link
Member

@bnaul how easy is it for you to reproduce this failure? If it's somewhat easy then you might want to try setting validate=True on the scheduler. That might help us to pinpoint the cause.

(adding this as a configuration option here: #3258)

If there is a task that is processing on a worker that doesn't exist then that's obviously a bug. We haven't seen bugs like this for a while, but I guess they still have to happen from time to time.

Another action to diagnose this would be to turn off work-stealing and see if the problem goes away. That might help isolate things to work-stealing.

@bnaul
Copy link
Contributor Author

bnaul commented Nov 22, 2019

Not easy at all unfortunately; it seems to be a pretty rare event (say once every 1000 hrs of worker compute time for our workload). I also suspect that work stealing is probably the culprit, but even after turning it off it'll probably be a while before I can say w/ much confidence that it's "resolved".

Also happy to try some runs with validate=True if you think the performance/stability impact isn't too big; I'll still need to execute 10,000s of tasks to get into the problematic situation in the first place.

@bnaul
Copy link
Contributor Author

bnaul commented Nov 22, 2019

Also @mrocklin we kept the deadlocked scheduler pod around to keep testing things, if there's any other information about the internal state that would be helpful let me know and we'll add it here.

@mrocklin
Copy link
Member

Well, if you wanted to you could look at scheduler.story(key) and also scheduler.events[worker_address] and see if you can reconstruct the story from there (although you may have already done this from the logs above).

@mrocklin
Copy link
Member

Also happy to try some runs with validate=True if you think the performance/stability impact isn't too big

Validate is expensive, but honestly I don't have a sense for how expensive when run in a real world setting. We mostly use this for testing, and once we've isolated things down to a small case.

@bnaul
Copy link
Contributor Author

bnaul commented Jun 13, 2020

This did end up being fixed by #3321 for the record

@bnaul bnaul closed this as completed Jun 13, 2020
@mrocklin
Copy link
Member

mrocklin commented Jun 13, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants