Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support receiving stimulus_id's in Scheduler.reschedule #6307

Closed

Conversation

sjperkins
Copy link
Member

@sjperkins sjperkins commented May 9, 2022

In the log output e.g. https://github.com/dask/distributed/runs/6348164203?check_suite_focus=true for d7549b0

the following errors are occuring but are silently ignored:

2022-05-09T08:38:04.1870734Z 2022-05-09 08:36:42,873 - distributed.core - ERROR - reschedule() got an unexpected keyword argument 'stimulus_id'
2022-05-09T08:38:04.1870845Z Traceback (most recent call last):
2022-05-09T08:38:04.1871030Z   File "/home/runner/work/distributed/distributed/distributed/core.py", line 651, in handle_stream
2022-05-09T08:38:04.1871136Z     handler(**merge(extra, msg))
2022-05-09T08:38:04.1871388Z TypeError: reschedule() got an unexpected keyword argument 'stimulus_id'
2022-05-09T08:38:04.1871819Z 2022-05-09 08:36:42,873 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:33919', name: 0, status: running, memory: 0, processing: 12>
2022-05-09T08:38:04.1872097Z 2022-05-09 08:36:42,873 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:33919
2022-05-09T08:38:04.1872525Z 2022-05-09 08:36:42,878 - distributed.core - ERROR - Exception while handling op register-worker
2022-05-09T08:38:04.1872639Z Traceback (most recent call last):
2022-05-09T08:38:04.1872833Z   File "/home/runner/work/distributed/distributed/distributed/core.py", line 585, in handle_comm
2022-05-09T08:38:04.1872910Z     result = await result
2022-05-09T08:38:04.1873097Z   File "/home/runner/work/distributed/distributed/distributed/utils.py", line 759, in wrapper
2022-05-09T08:38:04.1873201Z     return await func(*args, **kwargs)
2022-05-09T08:38:04.1873400Z   File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 3829, in add_worker
2022-05-09T08:38:04.1873572Z     await self.handle_worker(comm=comm, worker=address, stimulus_id=stimulus_id)
2022-05-09T08:38:04.1873775Z   File "/home/runner/work/distributed/distributed/distributed/scheduler.py", line 4924, in handle_worker
2022-05-09T08:38:04.1873926Z     await self.handle_stream(comm=comm, extra={"worker": worker})
2022-05-09T08:38:04.1874122Z   File "/home/runner/work/distributed/distributed/distributed/core.py", line 651, in handle_stream
2022-05-09T08:38:04.1874210Z     handler(**merge(extra, msg))
2022-05-09T08:38:04.1874465Z TypeError: reschedule() got an unexpected keyword argument 'stimulus_id'

worker_state_machine.RescheduleMsg always sends stimulus_ids so this should be handled.

  • Tests added / passed
  • Passes pre-commit run --all-files

@sjperkins sjperkins requested a review from crusaderky May 9, 2022 10:56
@github-actions
Copy link
Contributor

github-actions bot commented May 9, 2022

Unit Test Results

       16 files  ±    0         16 suites  ±0   6h 59m 45s ⏱️ - 45m 51s
  2 769 tests +    2    2 688 ✔️ +    2    78 💤  -   2  3 +2 
21 300 runs   - 798  20 310 ✔️  - 766  987 💤  - 34  3 +2 

For more details on these failures, see this check.

Results for commit f2f3a39. ± Comparison against base commit 8411c2d.

♻️ This comment has been updated with latest results.

@@ -6697,7 +6697,7 @@ async def get_story(self, keys=()):

transition_story = story

def reschedule(self, key=None, worker=None):
def reschedule(self, key=None, worker=None, stimulus_id=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not counting tests, I can see that this method is invoked only

  • from the worker, through RPC
  • directly, from stealing.py

In both cases a stimulus_id is available. So you could change this to mandatory.

@crusaderky
Copy link
Collaborator

crusaderky commented May 9, 2022

So the Scheduler.reschedule method is never successful when invoked from the worker, and yet no tests break.
In other words, at some point the status of the task stopped transitioning from processing to released on the scheduler when it transitions from executing/long-running to rescheduled on the worker, and nobody noticed.

What's the impact of having it work again? Could it be that the scheduler-side transition was unneeded to begin with? Or (at the other extreme) would this fix a potential deadlock?

I think understanding this is way more important than the fix itself (which is straightforward).

CC @fjetter @graingert @gjoseph92

@crusaderky
Copy link
Collaborator

With your latest changes, the two calls from stealing.py won't work anymore - and yet all tests still pass.

@crusaderky
Copy link
Collaborator

Related: #6332

@crusaderky
Copy link
Collaborator

Superseded by #6339

@crusaderky crusaderky closed this May 13, 2022
@crusaderky
Copy link
Collaborator

Follow up on #6340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants