Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix _update_scheduler_info hanging failed tests #7225

Merged
merged 1 commit into from
Nov 1, 2022

Conversation

gjoseph92
Copy link
Collaborator

I've noticed on a few occasions that when an assertion fails in a gen_cluster test, sometimes the test will not fail right away, but will hang for a long time throwing errors out of the _update_scheduler_info PeriodicCallback.

It doesn't make sense to me why this is possible, or why a worker client is connecting to the scheduler after the worker has shut down.

But this change makes the tests fail right away instead of hanging, which is a much nicer development experience.

2022-10-28 15:29:15,285 - distributed.scheduler - INFO - State start
2022-10-28 15:29:15,288 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:50287
2022-10-28 15:29:15,288 - distributed.scheduler - INFO -   dashboard at:           127.0.0.1:50286
2022-10-28 15:29:15,293 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:50288
2022-10-28 15:29:15,293 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:50288
2022-10-28 15:29:15,293 - distributed.worker - INFO -           Worker name:                          0
2022-10-28 15:29:15,293 - distributed.worker - INFO -          dashboard at:            127.0.0.1:50289
2022-10-28 15:29:15,293 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:50287
2022-10-28 15:29:15,293 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 15:29:15,293 - distributed.worker - INFO -               Threads:                          1
2022-10-28 15:29:15,293 - distributed.worker - INFO -                Memory:                  32.00 GiB
2022-10-28 15:29:15,293 - distributed.worker - INFO -       Local Directory: /var/folders/rs/wdnmv5lj02x7sh19rg3nyfyr0000gn/T/dask-worker-space/worker-iwvi60rr
2022-10-28 15:29:15,293 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 15:29:15,603 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:50288', name: 0, status: init, memory: 0, processing: 0>
2022-10-28 15:29:15,603 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:50288
2022-10-28 15:29:15,603 - distributed.core - INFO - Starting established connection
2022-10-28 15:29:15,604 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:50287
2022-10-28 15:29:15,604 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 15:29:15,604 - distributed.core - INFO - Starting established connection
2022-10-28 15:29:15,621 - distributed.scheduler - INFO - Receive client connection: Client-92b641aa-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,621 - distributed.core - INFO - Starting established connection
Dumped cluster state to test_cluster_dump/test_annotate_persist.yaml
2022-10-28 15:29:15,790 - distributed.scheduler - INFO - Remove client Client-92b641aa-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,790 - distributed.scheduler - INFO - Remove client Client-92b641aa-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,791 - distributed.scheduler - INFO - Close client connection: Client-92b641aa-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,793 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:50288. Reason: worker-close
2022-10-28 15:29:15,797 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:50288', name: 0, status: closing, memory: 0, processing: 1>
2022-10-28 15:29:15,797 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:50288
2022-10-28 15:29:15,797 - distributed.scheduler - INFO - Lost all workers
2022-10-28 15:29:15,797 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-301543c9-60d4-4938-8cec-8c7fb956dfe0 Address tcp://127.0.0.1:50288 Status: Status.closing
2022-10-28 15:29:15,798 - distributed.scheduler - INFO - Receive client connection: Client-worker-92d14b1c-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,799 - distributed.core - INFO - Starting established connection
2022-10-28 15:29:17,800 - tornado.application - ERROR - Exception in callback <bound method Client._update_scheduler_info of <Client: No scheduler connected>>
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/compatibility.py", line 163, in _run
    await val
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1367, in _update_scheduler_info
    self._scheduler_identity = SchedulerInfo(await self.scheduler.identity())
AttributeError: 'NoneType' object has no attribute 'identity'
2022-10-28 15:29:19,802 - tornado.application - ERROR - Exception in callback <bound method Client._update_scheduler_info of <Client: No scheduler connected>>
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/compatibility.py", line 163, in _run
    await val
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1367, in _update_scheduler_info
    self._scheduler_identity = SchedulerInfo(await self.scheduler.identity())
AttributeError: 'NoneType' object has no attribute 'identity'
2022-10-28 15:29:21,801 - tornado.application - ERROR - Exception in callback <bound method Client._update_scheduler_info of <Client: No scheduler connected>>
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/compatibility.py", line 163, in _run
    await val
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1367, in _update_scheduler_info
    self._scheduler_identity = SchedulerInfo(await self.scheduler.identity())
AttributeError: 'NoneType' object has no attribute 'identity'

...

and so on for either 30s or until I ctrl-C

cc @jacobtomlinson @jrbourbeau

  • Tests added / passed
  • Passes pre-commit run --all-files

@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 13m 37s ⏱️ - 27m 42s
  3 165 tests ±0    3 080 ✔️ +2    83 💤 ±0  2  - 1 
23 416 runs   - 1  22 514 ✔️ +6  900 💤  - 2  2  - 4 

For more details on these failures, see this check.

Results for commit bdfefeb. ± Comparison against base commit f31fbde.

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange, but definitely is an improvement.

@gjoseph92 gjoseph92 merged commit 5a14053 into dask:main Nov 1, 2022
@gjoseph92 gjoseph92 deleted the update-scheduler-info-no-rpc branch November 1, 2022 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants