Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent infinite transition loops; more aggressive validate_state() #6318

Merged
merged 4 commits into from
May 12, 2022

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented May 10, 2022

Partially closes #6305

#6305 introduces an infinite loop in the worker transitions.
Such loop never releases the GIL,
which means that the @gen_cluster timeout doesn't work,
which means that after 5 minutes pytest_timeout kicks in,
which means that you get no logs and no cluster dump whatsoever.

This PR:

  • Runs validate_state() at the end of every test for all workers, in addition to the scheduler. This excludes workers wrapped by nannies and workers not started by gen_cluster itself.
  • Runs validate_state() on Scheduler and Workers in case of TimeoutError, hoping to get an error about invalid state instead of the much more opaque timeout message
  • Implements a limit for the number of transitions a scheduler or a worker can go through before it breaks gracefully after at most 30s (gen_cluster timeout). When it happens, pytest records all logs and a full cluster dump is generated.

@github-actions
Copy link
Contributor

github-actions bot commented May 10, 2022

Unit Test Results

       15 files   -        1         15 suites   - 1   7h 2m 0s ⏱️ - 27m 11s
  2 774 tests +       3    2 695 ✔️ +       4    78 💤  -     1  1 ±0 
20 580 runs   - 1 550  19 671 ✔️  - 1 439  908 💤  - 111  1 ±0 

For more details on these failures, see this check.

Results for commit 35a5568. ± Comparison against base commit 9f02e7a.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky changed the title Prevent infinite transition loops Prevent infinite transition loops; more aggressive validate_state() May 11, 2022
@crusaderky crusaderky self-assigned this May 11, 2022
@crusaderky crusaderky marked this pull request as ready for review May 11, 2022 12:16
@crusaderky
Copy link
Collaborator Author

https://github.com/dask/distributed/runs/6386832455?check_suite_focus=true now shows the issue in #6305
The other two test failures are unrelated.

Note: this PR is likely to increase flakiness, and I think it's a good thing, as it will crop up a lot of previously undetected issues which are likely to cause deadlocks somewhere else.

crusaderky added a commit to crusaderky/distributed that referenced this pull request May 11, 2022
crusaderky added a commit to crusaderky/distributed that referenced this pull request May 11, 2022
Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though a little odd to expose in top-level config, this seems like a good thing to have and great to have in all tests.

To be clear, this doesn't actually fix #6248 (comment), just causes it to be caught more obviously in tests?

distributed/scheduler.py Outdated Show resolved Hide resolved
# catch potential infinite recursions
self.transition_counter += 1
if self.validate and self.transition_counter_max:
assert self.transition_counter < self.transition_counter_max
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice if this raised something like a TransitionCounterMaxExceeded error, to be consistent with workers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but scheduler doesn't have anything like InvalidTransitionError on the worker

@crusaderky
Copy link
Collaborator Author

To be clear, this doesn't actually fix #6248 (comment), just causes it to be caught more obviously in tests?

Correct

@crusaderky crusaderky merged commit d0fbba6 into dask:main May 12, 2022
@crusaderky crusaderky deleted the transition_counter_max branch May 12, 2022 10:53
@gjoseph92
Copy link
Collaborator

@crusaderky can we make the 1-line fix for #6305 now, or do you want to see this take effect in CI first to confirm it's working?

@crusaderky
Copy link
Collaborator Author

@crusaderky can we make the 1-line fix for #6305 now, or do you want to see this take effect in CI first to confirm it's working?

See #6327

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have a conversation about this. I think this limit should be removed again.

I don't see a huge value in it and would prefer having fewer configuration options, even for tests alone. Whenever we may write stress tests, etc. we'd need to adjust this limit (we can already see a few cases where the "default" limit appears to not be sufficient).

Comment on lines +280 to +283
# Cause scheduler and workers to break if they reach this many transitions.
# Used to debug infinite transition loops.
# Note: setting this will cause healthy long-running services to eventually break.
transition-counter-max: False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to not clutter this file with stuff we only use for internal tests. Apart from tests I don't see the usefulness of having a global limit on the number of transitions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is for tests only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_stress_scatter_death
3 participants