Fix `_update_scheduler_info` hanging failed tests #7225

gjoseph92 · 2022-10-28T21:35:57Z

I've noticed on a few occasions that when an assertion fails in a gen_cluster test, sometimes the test will not fail right away, but will hang for a long time throwing errors out of the _update_scheduler_info PeriodicCallback.

It doesn't make sense to me why this is possible, or why a worker client is connecting to the scheduler after the worker has shut down.

But this change makes the tests fail right away instead of hanging, which is a much nicer development experience.

2022-10-28 15:29:15,285 - distributed.scheduler - INFO - State start
2022-10-28 15:29:15,288 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:50287
2022-10-28 15:29:15,288 - distributed.scheduler - INFO -   dashboard at:           127.0.0.1:50286
2022-10-28 15:29:15,293 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:50288
2022-10-28 15:29:15,293 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:50288
2022-10-28 15:29:15,293 - distributed.worker - INFO -           Worker name:                          0
2022-10-28 15:29:15,293 - distributed.worker - INFO -          dashboard at:            127.0.0.1:50289
2022-10-28 15:29:15,293 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:50287
2022-10-28 15:29:15,293 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 15:29:15,293 - distributed.worker - INFO -               Threads:                          1
2022-10-28 15:29:15,293 - distributed.worker - INFO -                Memory:                  32.00 GiB
2022-10-28 15:29:15,293 - distributed.worker - INFO -       Local Directory: /var/folders/rs/wdnmv5lj02x7sh19rg3nyfyr0000gn/T/dask-worker-space/worker-iwvi60rr
2022-10-28 15:29:15,293 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 15:29:15,603 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://127.0.0.1:50288', name: 0, status: init, memory: 0, processing: 0>
2022-10-28 15:29:15,603 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:50288
2022-10-28 15:29:15,603 - distributed.core - INFO - Starting established connection
2022-10-28 15:29:15,604 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:50287
2022-10-28 15:29:15,604 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 15:29:15,604 - distributed.core - INFO - Starting established connection
2022-10-28 15:29:15,621 - distributed.scheduler - INFO - Receive client connection: Client-92b641aa-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,621 - distributed.core - INFO - Starting established connection
Dumped cluster state to test_cluster_dump/test_annotate_persist.yaml
2022-10-28 15:29:15,790 - distributed.scheduler - INFO - Remove client Client-92b641aa-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,790 - distributed.scheduler - INFO - Remove client Client-92b641aa-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,791 - distributed.scheduler - INFO - Close client connection: Client-92b641aa-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,793 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:50288. Reason: worker-close
2022-10-28 15:29:15,797 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://127.0.0.1:50288', name: 0, status: closing, memory: 0, processing: 1>
2022-10-28 15:29:15,797 - distributed.core - INFO - Removing comms to tcp://127.0.0.1:50288
2022-10-28 15:29:15,797 - distributed.scheduler - INFO - Lost all workers
2022-10-28 15:29:15,797 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-301543c9-60d4-4938-8cec-8c7fb956dfe0 Address tcp://127.0.0.1:50288 Status: Status.closing
2022-10-28 15:29:15,798 - distributed.scheduler - INFO - Receive client connection: Client-worker-92d14b1c-5707-11ed-9c20-acde48001122
2022-10-28 15:29:15,799 - distributed.core - INFO - Starting established connection
2022-10-28 15:29:17,800 - tornado.application - ERROR - Exception in callback <bound method Client._update_scheduler_info of <Client: No scheduler connected>>
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/compatibility.py", line 163, in _run
    await val
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1367, in _update_scheduler_info
    self._scheduler_identity = SchedulerInfo(await self.scheduler.identity())
AttributeError: 'NoneType' object has no attribute 'identity'
2022-10-28 15:29:19,802 - tornado.application - ERROR - Exception in callback <bound method Client._update_scheduler_info of <Client: No scheduler connected>>
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/compatibility.py", line 163, in _run
    await val
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1367, in _update_scheduler_info
    self._scheduler_identity = SchedulerInfo(await self.scheduler.identity())
AttributeError: 'NoneType' object has no attribute 'identity'
2022-10-28 15:29:21,801 - tornado.application - ERROR - Exception in callback <bound method Client._update_scheduler_info of <Client: No scheduler connected>>
Traceback (most recent call last):
  File "/Users/gabe/dev/distributed/distributed/compatibility.py", line 163, in _run
    await val
  File "/Users/gabe/dev/distributed/distributed/client.py", line 1367, in _update_scheduler_info
    self._scheduler_identity = SchedulerInfo(await self.scheduler.identity())
AttributeError: 'NoneType' object has no attribute 'identity'

...

and so on for either 30s or until I ctrl-C

cc @jacobtomlinson @jrbourbeau

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2022-10-28T23:33:54Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 13m 37s ⏱️ - 27m 42s
  3 165 tests ±0   3 080 ✔️ +2   83 💤 ±0 2 ❌ - 1
23 416 runs - 1 22 514 ✔️ +6 900 💤 - 2 2 ❌ - 4

For more details on these failures, see this check.

Results for commit bdfefeb. ± Comparison against base commit f31fbde.

jacobtomlinson

Strange, but definitely is an improvement.

Fix _update_scheduler_info hanging failed tests

bdfefeb

jacobtomlinson approved these changes Nov 1, 2022

View reviewed changes

gjoseph92 merged commit 5a14053 into dask:main Nov 1, 2022

gjoseph92 deleted the update-scheduler-info-no-rpc branch November 1, 2022 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `_update_scheduler_info` hanging failed tests #7225

Fix `_update_scheduler_info` hanging failed tests #7225

gjoseph92 commented Oct 28, 2022

github-actions bot commented Oct 28, 2022

jacobtomlinson left a comment

Fix _update_scheduler_info hanging failed tests #7225

Fix _update_scheduler_info hanging failed tests #7225

Conversation

gjoseph92 commented Oct 28, 2022

github-actions bot commented Oct 28, 2022

Unit Test Results

jacobtomlinson left a comment

Choose a reason for hiding this comment

Fix `_update_scheduler_info` hanging failed tests #7225

Fix `_update_scheduler_info` hanging failed tests #7225