Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test_stress_scatter_death #6404

Merged
merged 3 commits into from
Jun 6, 2022

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented May 20, 2022

Running test_scatter_death locally, I ran into three issues

  1. an assert on a released task. This assert statement is currently wrong. It's a subtle race condition with cancelled states which can leave a released task around with informatino about who_has/has_what without transitioning it to fetch
  2. if a dependency of a task is in state missing we should not try to transition it to fetch when performing the transition_released_waiting transition. This can cause a assert ts.who_has failure.
  3. I'm still facing a very rare KeyError in validate_state This seems to be connected to 1.) If a task is merely in released state with who_has/has_what information and it is afterwards forgotten, we do not call the release_key method which purges task related state (name is confusing atm), i.e. real world implication is that the worker may have "zombie who_has information" which should not be harmful

Closes #6305
Closes #6191

I'm pretty sure 1. and 2. have been around a while. No idea why this is more likely to fail lately. @crusaderky mentioned a potential connection to #6210 indicating that this always asserted/raised but we simply never noticed

I also am fairly certain that the system can recover from 2. if we were not to raise an assert statement and would not deadlock.

cc @crusaderky @jrbourbeau

@fjetter
Copy link
Member Author

fjetter commented May 20, 2022

With the fixes on this PR, test_scatter_death appears to be rock solid on my local machine. I will clean up and write tests for this next week before we merge it

This should not block the release

@github-actions
Copy link
Contributor

github-actions bot commented May 20, 2022

Unit Test Results

       15 files  ±  0         15 suites  ±0   6h 28m 18s ⏱️ + 1m 56s
  2 837 tests +  5    2 755 ✔️ +  9    81 💤 ±0  1  - 3 
21 021 runs  +35  20 073 ✔️ +40  947 💤 +1  1  - 5 

For more details on these failures, see this check.

Results for commit 853a59f. ± Comparison against base commit 7e49d88.

♻️ This comment has been updated with latest results.

@fjetter fjetter marked this pull request as ready for review May 24, 2022 13:47
@fjetter fjetter changed the title WIP Do not transition to fetch is dep is missing during handle_compute Do not transition to fetch is dep is missing during handle_compute May 24, 2022
@fjetter fjetter requested review from crusaderky and gjoseph92 and removed request for crusaderky May 24, 2022 13:51
Comment on lines -3511 to -3570
if ts.resource_restrictions is not None:
if ts.state == "executing":
for resource, quantity in ts.resource_restrictions.items():
self.available_resources[resource] += quantity
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this transition specific logic from the method since this is already in place for all transitions specifically.

Everything else in this method is due to unindentation

Comment on lines +462 to +464
async def test_resumed_cancelled_handle_compute(
c, s, a, b, raise_error, wait_for_processing
):
Copy link
Member Author

@fjetter fjetter May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers the changes in handle_compute where I moved around task states in the switch statement and added an error handler that was missing before

@fjetter fjetter requested a review from crusaderky May 25, 2022 17:10
@pentschev
Copy link
Member

rerun tests

Note: rerunning gpuCI tests since those errors should be fixed by #6434

"""Test that it is OK for a dependency to be in state missing if a dependent is asked to be computed"""

f1 = c.submit(inc, 1, key="f1", workers=[w1.address])
f2 = c.submit(inc, 2, key="f2", workers=[w1.address])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f2 is unnecessary for the purpose of this test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to have multiple keys fetched from w1. I don't think this is absolutely required but I decided to keep this in the test

distributed/worker.py Outdated Show resolved Hide resolved
Comment on lines 3487 to 3490
if key in self.data:
del self.data[key]
if key in self.actors:
del self.actors[key]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if key in self.data:
del self.data[key]
if key in self.actors:
del self.actors[key]
self.data.pop(key, None)
self.actors.pop(key, None)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: self.data.pop may raise OSError if the key is spilled to disk and there is a disk malfunction. This should be treated as an exceptional, unrecoverable situation and is dealt with by the @fail_hard decorator around handle_stimulus.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I decided to get rid of the FileNotFoundError here. If this actually causes a problem, I prefer to fix this in our data/spilling layer.

distributed/worker.py Outdated Show resolved Hide resolved
distributed/worker.py Outdated Show resolved Hide resolved
distributed/tests/test_cancelled_state.py Outdated Show resolved Hide resolved
f1.release()
f2.release()
f3.release()
f4.release()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the scheduler going to send the release_key commands to the worker starting from the dependents and then descending into the dependencies, or in random order?

Copy link
Member Author

@fjetter fjetter Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It always walks back the dependency tree. There is some ambiguity on worker A since once we start computing f3, the only reason why f1 is not released yet is because the client holds on to an explicit ref. Therefore, the order in which f1 and f2 are released on A depends on the order the client releases the keys.

However, this test focuses exclusively on worker B so I think it is fine to have this ambiguity in favor of test runtime and test complexity. If you think it matters, I can parametrize to switch this ordering.

About release vs. del

I use release mostly because I like holding on to the future object to not depend on hard coded key names. That's a style choice. using del f1, f2, f3, f4 should have the exact same effect only that I am relying on the python ref counting to release them instead of releasing them explicitly. I like being explicit as well but that's again a style choice

distributed/tests/test_cancelled_state.py Outdated Show resolved Hide resolved
distributed/tests/test_cancelled_state.py Outdated Show resolved Hide resolved
@fjetter

This comment was marked as outdated.

@crusaderky crusaderky mentioned this pull request Jun 1, 2022
@fjetter fjetter force-pushed the more_fixes_for_test_scatter_death branch from 0144bb2 to b404dd4 Compare June 2, 2022 09:02
Comment on lines -3585 to -3587
self._notify_plugins(
"release_key", key, state_before, cause, stimulus_id, report
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now gone

@fjetter

This comment was marked as outdated.

distributed/worker.py Outdated Show resolved Hide resolved
@fjetter fjetter changed the title Do not transition to fetch is dep is missing during handle_compute Fix test_stress_scatter_death Jun 3, 2022
distributed/worker.py Show resolved Hide resolved
except InvalidTransition:
# ValueError may be raised by merge_recs_instructions
# TODO: should merge_recs raise InvalidTransition?
except (ValueError, InvalidTransition):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like this. ValueError is too broad and generic to be handled here. I don't want to hold the PR up; 'll open a new one shortly to narrow it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_stress_scatter_death Invalid TaskState released
3 participants