Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate and debug state machine on handle_compute_task #6327

Merged
merged 8 commits into from
May 13, 2022

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented May 11, 2022

@crusaderky crusaderky changed the title Infinite transition loop Infinite released->missing transition loop May 11, 2022
@crusaderky crusaderky self-assigned this May 11, 2022
@github-actions
Copy link
Contributor

github-actions bot commented May 12, 2022

Unit Test Results

       15 files  +       3         15 suites  +3   6h 58m 2s ⏱️ + 1h 9m 0s
  2 774 tests ±       0    2 694 ✔️ +     13    79 💤  -   12  1  - 1 
20 580 runs  +3 962  19 672 ✔️ +3 850  907 💤 +113  1  - 1 

For more details on these failures, see this check.

Results for commit 79402f2. ± Comparison against base commit 4b81f06.

♻️ This comment has been updated with latest results.

@@ -652,7 +652,7 @@ def __init__(
("ready", "released"): self.transition_generic_released,
("released", "error"): self.transition_generic_error,
("released", "fetch"): self.transition_released_fetch,
("released", "missing"): self.transition_released_fetch,
("released", "missing"): self.transition_generic_missing,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for dep_key, value in nbytes.items():
self.tasks[dep_key].nbytes = value

self.update_who_has(who_has)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This move prevents deps to be created in resumed state by ensure_task_exists and then remain there because there's nothing actually needing them.


if ts.state in READY | {"executing", "waiting", "resumed"}:
if ts.state in READY | {"executing", "long-running", "waiting", "resumed"}:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I omitted a unit test for this - something to write after the state machine refactor for sure

@@ -3417,7 +3434,6 @@ async def find_missing(self) -> None:
self.scheduler.who_has,
keys=[ts.key for ts in self._missing_dep_flight],
)
who_has = {k: v for k, v in who_has.items() if v}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant - update_who_has already throws away empty lists of workers

@@ -121,58 +120,43 @@ async def create_and_destroy_worker(delay):
assert await c.compute(z) == 8000884.93


@gen_cluster(nthreads=[("127.0.0.1", 1)] * 10, client=True, timeout=60)
@gen_cluster(nthreads=[("", 1)] * 10, client=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is functionally identical to before - all changes are just cosmetic.

Self-review


Self-review


self-review
@crusaderky
Copy link
Collaborator Author

crusaderky commented May 12, 2022

The CI failure is on these new lines at the end of handle_compute_task:

            for dep_ts in ts.dependencies:
                assert dep_ts.state != "released", self.story(dep_ts)

I can reproduce it in 0.4% (4 out of 1000) of runs on my desktop. I'll investigate over the next few days.
In the meantime, I think this PR can be reviewed and merged as is.

CC @fjetter @gjoseph92 @graingert

@crusaderky
Copy link
Collaborator Author

The one-liner fix for the infinite transition has been merged in #6331.
Explanation is here: https://github.com/dask/distributed/pull/6248/files#r870780929

What is left in this PR is a wealth of hardening, which removes some cases for corrupted state and makes other crop up sooner.

# ensure_tasks_exists() have been transitioned to fetch or flight
assert all(
ts2.state != "released" for ts2 in (ts, *ts.dependencies)
), self.story(ts, *ts.dependencies)
Copy link
Collaborator Author

@crusaderky crusaderky May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment of writing, this assertion fails in test_stress_scatter_death 0.4% of the times on a fast desktop.
Explanation in #6305. Resolution out of scope for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this assert is not 100% correct. There is a case for valid tasks left in released in the case of cancelled/resumed tasks. I'll open a follow up PR with a case reproducing this condition

Copy link
Member

@fjetter fjetter May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flowchart TD
  A1[A1 - forgotten / not known] --> B1[B1 - flight]
  A2[A1 - forgotten / not known] --> B1[B1 - flight]
  B1 --> C1[C1 - waiting]
Loading

free-keys / cancel B1

flowchart TD
  A1[A1 - forgotten / not known] --> B1[B1 - cancelled]
  A2[A1 - forgotten / not known] --> B1[B1 - cancelled]
  B1 --> C1[C1 - forgotten]
Loading

compute-task B1

flowchart TD
  A1[A1 - released] --> B1[B1 - resumed]
  A2[A1 - released] --> B1[B1 - resumed]
  B1 --> C1[C1 - forgotten]
Loading

gather-dep finishes w/ Error

flowchart TD
  A1[A1 - fetch] --> B1[B1 - waiting]
  A2[A1 - fetch] --> B1[B1 - waiting]
  B1 --> C1[C1 - forgotten]
Loading

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand from the above diagram how this can end up with a released state by the end of compute-task?

@crusaderky crusaderky changed the title Infinite released->missing transition loop Validate and debug state machine on handle_compute_task May 13, 2022
@mrocklin
Copy link
Member

Nothing here concerns me. The biggest change is the transition change, and that was from Guido anyway. The failing test is concerning, but it seems like it's just shining a light on something that was broken before. Merging.

@mrocklin mrocklin merged commit 79d5a77 into dask:main May 13, 2022
Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was merged midway through my reviewing, but it also looks good to me. The failing test is just the 0.4% of cases where test_stress_scatter_death is known to still fail, which is a regression that still needs to be addressed #6305.

@mrocklin
Copy link
Member

Ah, my apologies for jumping ahead.

@crusaderky crusaderky deleted the test_scatter_death branch May 13, 2022 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_stress_scatter_death
4 participants