Fix test_stress_scatter_death #6404

fjetter · 2022-05-20T17:15:28Z

Running test_scatter_death locally, I ran into three issues

an assert on a released task. This assert statement is currently wrong. It's a subtle race condition with cancelled states which can leave a released task around with informatino about who_has/has_what without transitioning it to fetch
if a dependency of a task is in state missing we should not try to transition it to fetch when performing the transition_released_waiting transition. This can cause a assert ts.who_has failure.
I'm still facing a very rare KeyError in validate_state This seems to be connected to 1.) If a task is merely in released state with who_has/has_what information and it is afterwards forgotten, we do not call the release_key method which purges task related state (name is confusing atm), i.e. real world implication is that the worker may have "zombie who_has information" which should not be harmful

Closes #6305
Closes #6191

I'm pretty sure 1. and 2. have been around a while. No idea why this is more likely to fail lately. @crusaderky mentioned a potential connection to #6210 indicating that this always asserted/raised but we simply never noticed

I also am fairly certain that the system can recover from 2. if we were not to raise an assert statement and would not deadlock.

cc @crusaderky @jrbourbeau

fjetter · 2022-05-20T17:38:41Z

With the fixes on this PR, test_scatter_death appears to be rock solid on my local machine. I will clean up and write tests for this next week before we merge it

This should not block the release

github-actions · 2022-05-20T20:27:11Z

Unit Test Results

      15 files ±  0       15 suites ±0 6h 28m 18s ⏱️ + 1m 56s
  2 837 tests +  5   2 755 ✔️ +  9   81 💤 ±0 1 ❌ - 3
21 021 runs +35 20 073 ✔️ +40 947 💤 +1 1 ❌ - 5

For more details on these failures, see this check.

Results for commit 853a59f. ± Comparison against base commit 7e49d88.

♻️ This comment has been updated with latest results.

fjetter · 2022-05-24T13:52:55Z

distributed/worker.py

-            if ts.resource_restrictions is not None:
-                if ts.state == "executing":
-                    for resource, quantity in ts.resource_restrictions.items():
-                        self.available_resources[resource] += quantity


I removed this transition specific logic from the method since this is already in place for all transitions specifically.

Everything else in this method is due to unindentation

fjetter · 2022-05-24T13:54:25Z

distributed/tests/test_cancelled_state.py

+async def test_resumed_cancelled_handle_compute(
+    c, s, a, b, raise_error, wait_for_processing
+):


This covers the changes in handle_compute where I moved around task states in the switch statement and added an error handler that was missing before

pentschev · 2022-05-26T15:57:30Z

rerun tests

Note: rerunning gpuCI tests since those errors should be fixed by #6434

distributed/tests/test_worker_state_machine.py

crusaderky · 2022-05-31T10:36:17Z

distributed/tests/test_worker_state_machine.py

+    """Test that it is OK for a dependency to be in state missing if a dependent is asked to be computed"""
+
+    f1 = c.submit(inc, 1, key="f1", workers=[w1.address])
+    f2 = c.submit(inc, 2, key="f2", workers=[w1.address])


f2 is unnecessary for the purpose of this test

I added this to have multiple keys fetched from w1. I don't think this is absolutely required but I decided to keep this in the test

distributed/worker.py

crusaderky · 2022-05-31T10:45:21Z

distributed/worker.py

+        if key in self.data:
+            del self.data[key]
+        if key in self.actors:
+            del self.actors[key]


Suggested change

if key in self.data:

del self.data[key]

if key in self.actors:

del self.actors[key]

self.data.pop(key, None)

self.actors.pop(key, None)

Note: self.data.pop may raise OSError if the key is spilled to disk and there is a disk malfunction. This should be treated as an exceptional, unrecoverable situation and is dealt with by the @fail_hard decorator around handle_stimulus.

Yes, I decided to get rid of the FileNotFoundError here. If this actually causes a problem, I prefer to fix this in our data/spilling layer.

distributed/worker.py

distributed/tests/test_cancelled_state.py

crusaderky · 2022-05-31T11:13:16Z

distributed/tests/test_cancelled_state.py

+        f1.release()
+        f2.release()
+        f3.release()
+        f4.release()


Is the scheduler going to send the release_key commands to the worker starting from the dependents and then descending into the dependencies, or in random order?

It always walks back the dependency tree. There is some ambiguity on worker A since once we start computing f3, the only reason why f1 is not released yet is because the client holds on to an explicit ref. Therefore, the order in which f1 and f2 are released on A depends on the order the client releases the keys.

However, this test focuses exclusively on worker B so I think it is fine to have this ambiguity in favor of test runtime and test complexity. If you think it matters, I can parametrize to switch this ordering.

About release vs. del

I use release mostly because I like holding on to the future object to not depend on hard coded key names. That's a style choice. using del f1, f2, f3, f4 should have the exact same effect only that I am relying on the python ref counting to release them instead of releasing them explicitly. I like being explicit as well but that's again a style choice

distributed/tests/test_cancelled_state.py

fjetter · 2022-06-02T09:03:43Z

distributed/worker.py

-            self._notify_plugins(
-                "release_key", key, state_before, cause, stimulus_id, report
-            )


This is now gone

distributed/worker.py

crusaderky · 2022-06-06T13:56:36Z

distributed/worker.py

-            except InvalidTransition:
+            # ValueError may be raised by merge_recs_instructions
+            # TODO: should merge_recs raise InvalidTransition?
+            except (ValueError, InvalidTransition):


I really don't like this. ValueError is too broad and generic to be handled here. I don't want to hold the PR up; 'll open a new one shortly to narrow it down.

fjetter mentioned this pull request May 20, 2022

Release 2022.05.1 dask/community#245

Closed

fjetter mentioned this pull request May 24, 2022

test_ucx_unreachable failed on gpuCI #6429

Closed

fjetter marked this pull request as ready for review May 24, 2022 13:47

fjetter changed the title ~~WIP Do not transition to fetch is dep is missing during handle_compute~~ Do not transition to fetch is dep is missing during handle_compute May 24, 2022

fjetter requested review from crusaderky and gjoseph92 and removed request for crusaderky May 24, 2022 13:51

fjetter commented May 24, 2022

View reviewed changes

fjetter requested a review from crusaderky May 25, 2022 17:10

crusaderky requested changes May 31, 2022

View reviewed changes

This comment was marked as outdated.

Sign in to view

crusaderky mentioned this pull request Jun 1, 2022

Refactor gather_dep #6388

Merged

Do not transition to fetch is dep is missing during handle_compute

b404dd4

fjetter force-pushed the more_fixes_for_test_scatter_death branch from 0144bb2 to b404dd4 Compare June 2, 2022 09:02

fjetter commented Jun 2, 2022

View reviewed changes

This comment was marked as outdated.

Sign in to view

fjetter commented Jun 2, 2022

View reviewed changes

distributed/worker.py Outdated Show resolved Hide resolved

More fixes

853a59f

fjetter changed the title ~~Do not transition to fetch is dep is missing during handle_compute~~ Fix test_stress_scatter_death Jun 3, 2022

crusaderky reviewed Jun 6, 2022

View reviewed changes

Update distributed/worker.py

665ebb4

crusaderky approved these changes Jun 6, 2022

View reviewed changes

crusaderky assigned fjetter Jun 6, 2022

crusaderky merged commit d360b63 into dask:main Jun 6, 2022

crusaderky mentioned this pull request Jun 6, 2022

Don't invoke log_event from state machine #6512

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test_stress_scatter_death #6404

Fix test_stress_scatter_death #6404

fjetter commented May 20, 2022 •

edited

Loading

fjetter commented May 20, 2022

github-actions bot commented May 20, 2022 •

edited

Loading

fjetter May 24, 2022

fjetter May 24, 2022 •

edited by crusaderky

Loading

pentschev commented May 26, 2022

crusaderky May 31, 2022

fjetter Jun 1, 2022

crusaderky May 31, 2022

crusaderky May 31, 2022

fjetter Jun 1, 2022

crusaderky May 31, 2022

fjetter Jun 1, 2022 •

edited

Loading

This comment was marked as outdated.

fjetter Jun 2, 2022

This comment was marked as outdated.

crusaderky Jun 6, 2022

Fix test_stress_scatter_death #6404

Fix test_stress_scatter_death #6404

Conversation

fjetter commented May 20, 2022 • edited Loading

fjetter commented May 20, 2022

github-actions bot commented May 20, 2022 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

fjetter May 24, 2022 • edited by crusaderky Loading

Choose a reason for hiding this comment

pentschev commented May 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

About release vs. del

This comment was marked as outdated.

Choose a reason for hiding this comment

This comment was marked as outdated.

Choose a reason for hiding this comment

fjetter commented May 20, 2022 •

edited

Loading

github-actions bot commented May 20, 2022 •

edited

Loading

fjetter May 24, 2022 •

edited by crusaderky

Loading

fjetter Jun 1, 2022 •

edited

Loading