test_stress_scatter_death #6305

crusaderky · 2022-05-09T10:03:28Z

test_stress_scatter_death has suddenly started hanging very frequently on Windows and MacOSX.
According to https://dask.org/distributed/test_report.html, the potential culprit is either #6217 or #6248.
CC @fjetter

Note: in the test report above, the failures on Windows are marked as a white box instead of red due to #6304.

The text was updated successfully, but these errors were encountered:

crusaderky · 2022-05-09T10:20:54Z

Update: it hangs on Ubuntu too: https://github.com/dask/distributed/runs/6349119823?check_suite_focus=true

It looks like an infinite transition loop (but I may be wrong)

ERROR    asyncio:base_events.py:1744 Exception in callback IOLoop.add_future.<locals>.<lambda>(<Task finishe...meout >300.0s>) at /usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py:688
handle: <Handle IOLoop.add_future.<locals>.<lambda>(<Task finishe...meout >300.0s>) at /usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py:688>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py", line 688, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 175, in wrapper
    return await method(self, *args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1278, in handle_scheduler
    await self.handle_stream(comm, every_cycle=[self.ensure_communicating])
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 651, in handle_stream
    handler(**merge(extra, msg))
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2013, in handle_compute_task
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2764, in transitions
    a_recs, a_instructions = self._transition(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2651, in _transition
    func = self._transitions_table.get((start, cast(str, finish)))
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/typing.py", line 1736, in cast
    def cast(typ, val):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pytest_timeout.py", line 241, in handler
    timeout_sigalrm(item, settings.timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/pytest_timeout.py", line 409, in timeout_sigalrm
    pytest.fail("Timeout >%ss" % timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/_pytest/outcomes.py", line 196, in fail
    raise Failed(msg=reason, pytrace=pytrace)
Failed: Timeout >300.0s

jakirkham · 2022-05-10T00:31:32Z

Also seen in this CI run.

crusaderky · 2022-05-10T10:25:30Z

Replicated on high-powered desktop after 51 runs

crusaderky · 2022-05-11T10:59:13Z

Downstream of #6318, I found three separate issues:

Infinite transition loop (failed 8 out of 1000 runs)


2022-05-11 11:37:20,905 - distributed.core - ERROR - TransitionCounterMaxExceeded: slowadd-1-17 :: released->missing
  Story:
    ('slowadd-1-17', 'ensure-task-exists', 'released', 'compute-task-1652265440.70282', 1652265440.7033925)
    ('slowadd-1-17', 'released', 'fetch', 'fetch', {}, 'compute-task-1652265440.70282', 1652265440.703413)
    ('gather-dependencies', 'tcp://127.0.0.1:35075', {'slowadd-1-17'}, 'ensure-communicating-1652265440.7034416', 1652265440.7034585)
    ('slowadd-1-17', 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1652265440.7034416', 1652265440.7034647)
    ('request-dep', 'tcp://127.0.0.1:35075', {'slowadd-1-17'}, 'ensure-communicating-1652265440.7034416', 1652265440.7036455)
    ('receive-dep', 'tcp://127.0.0.1:35075', {'slowadd-1-17'}, 'ensure-communicating-1652265440.7034416', 1652265440.7087567)
    ('slowadd-1-17', 'put-in-memory', 'ensure-communicating-1652265440.7034416', 1652265440.7087805)
    ('slowadd-1-17', 'flight', 'memory', 'memory', {'slowadd-2-16': 'ready'}, 'ensure-communicating-1652265440.7034416', 1652265440.708784)
    ('slowadd-1-17', 'compute-task', 'compute-task-1652265440.7051446', 1652265440.7092702)
    ('remove-replicas', ('slowadd-1-17',), 'ensure-communicating-1652265440.7034416', 1652265440.7145007)
    ('slowadd-1-17', 'remove-replica-confirmed', 'ensure-communicating-1652265440.7034416', 1652265440.714504)
    ('slowadd-1-17', 'release-key', 'ensure-communicating-1652265440.7034416', 1652265440.714511)
    ('slowadd-1-17', 'memory', 'released', 'released', {'slowadd-1-17': 'forgotten'}, 'ensure-communicating-1652265440.7034416', 1652265440.7145405)
    ('slowadd-1-17', 'released', 'forgotten', 'forgotten', {}, 'ensure-communicating-1652265440.7034416', 1652265440.7145462)
    ('slowadd-1-17', 'ensure-task-exists', 'released', 'compute-task-1652265440.7120328', 1652265440.7145758)
    ('slowadd-1-17', 'released', 'fetch', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145944)
    ('slowadd-1-17', 'released', 'missing', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145967)
    ('slowadd-1-17', 'released', 'missing', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145994)
    [repeat last line to infinity]

failed has_what validaton (failed 5 out of 1000 runs)

  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4186, in validate_task
    self.validate_task_released(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4157, in validate_task_released
    assert not any(ts.key in has_what for has_what in self.has_what.values())
AssertionError

Failed data_needed validation (failed 1 out of 1000 runs)


Traceback (most recent call last):
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4184, in validate_task
    self.validate_task_fetch(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4125, in validate_task_fetch
    assert ts in self.data_needed
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4246, in validate_state
    self.validate_task(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4204, in validate_task
    raise AssertionError(
AssertionError: Invalid TaskState encountered for <TaskState 'ndarray-9e6174fa82b2a0609f4482bf0c7e4d80' fetch>.
Story:
[('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'ensure-task-exists', 'released', 'compute-task-1652266312.2669225', 1652266312.2685668), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'released', 'fetch', 'fetch', {}, 'compute-task-1652266312.2669225', 1652266312.2685966), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'release-key', 'worker-close-1652266312.265506', 1652266312.2686384), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'fetch', 'released', 'released', {'ndarray-9e6174fa82b2a0609f4482bf0c7e4d80': 'forgotten'}, 'worker-close-1652266312.265506', 1652266312.2686439), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'released', 'forgotten', 'forgotten', {}, 'worker-close-1652266312.265506', 1652266312.2686467), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'ensure-task-exists', 'released', 'compute-task-1652266312.2671275', 1652266312.2687173), ('ndarray-9e6174fa82b2a0609f4482bf0c7e4d80', 'released', 'fetch', 'fetch', {}, 'compute-task-1652266312.2671275', 1652266312.268733)]

I have no evidence to suggest that the last two are recent regressions; they may have been there for a long time but we were not invoking Worker.validate_state before.

crusaderky · 2022-05-11T17:04:09Z

@fjetter @jrbourbeau @jsignell I think this should be treated as a blocker for the next release

crusaderky · 2022-05-11T19:12:00Z

Dumping down my understanding so far of the story:

    ('remove-replicas', ('slowadd-1-17',), 'ensure-communicating-1652265440.7034416', 1652265440.7145007)
    ('slowadd-1-17', 'remove-replica-confirmed', 'ensure-communicating-1652265440.7034416', 1652265440.714504)
    ('slowadd-1-17', 'release-key', 'ensure-communicating-1652265440.7034416', 1652265440.714511)
    ('slowadd-1-17', 'memory', 'released', 'released', {'slowadd-1-17': 'forgotten'}, 'ensure-communicating-1652265440.7034416', 1652265440.7145405)
    ('slowadd-1-17', 'released', 'forgotten', 'forgotten', {}, 'ensure-communicating-1652265440.7034416', 1652265440.7145462)
    ('slowadd-1-17', 'ensure-task-exists', 'released', 'compute-task-1652265440.7120328', 1652265440.7145758)
    ('slowadd-1-17', 'released', 'fetch', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145944)
    ('slowadd-1-17', 'released', 'missing', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145967)

Let's break it down:

    ('remove-replicas', ('slowadd-1-17',), 'ensure-communicating-1652265440.7034416', 1652265440.7145007)
    ('slowadd-1-17', 'remove-replica-confirmed', 'ensure-communicating-1652265440.7034416', 1652265440.714504)
    ('slowadd-1-17', 'release-key', 'ensure-communicating-1652265440.7034416', 1652265440.714511)
    ('slowadd-1-17', 'memory', 'released', 'released', {'slowadd-1-17': 'forgotten'}, 'ensure-communicating-1652265440.7034416', 1652265440.7145405)
    ('slowadd-1-17', 'released', 'forgotten', 'forgotten', {}, 'ensure-communicating-1652265440.7034416', 1652265440.7145462)

The key was in memory on the worker.
The scheduler asked the worker to drop it, either from Scheduler.add_keys or Scheduler.add_worker.

    ('slowadd-1-17', 'ensure-task-exists', 'released', 'compute-task-1652265440.7120328', 1652265440.7145758)

this is saying that the scheduler believes that slowadd-1-17 is in memory on another worker, so it just scheduled either slowadd-2-16 or slowadd-2-17 to run on this worker (the two dependents of slowadd-1-17).
The method call should look like this:

Worker.handle_compute_task(
   key="slowadd-2-17",
   who_has={"slowadd-1-17": [w1, w2, ...], "slowadd-1-18": [w1, w2, ...]},
    ...
)

The same method, at the end, calls self.update_who_has(who_has), which sets ts.who_has = {w1, w2, ...}.

The released bit next to ensure-task-exists is confirming that the task is nowhere to be found in Worker.tasks - e.g. the previous release-key command was completed cleanly.

    ('slowadd-1-17', 'released', 'fetch', 'released', {'slowadd-1-17': 'missing'}, 'compute-task-1652265440.7120328', 1652265440.7145944)

This is what baffles me. This should not be possible when stimulus_id="compute-task-...".

Here we just executed Worker.transition_released_fetch(ts); however ts.who_has by now has become empty, so the function returned {ts: "missing"}, [].

However, this is in the same stimulus_id as in ensure-task-exists - meaning there shouldn't have been anything in between. Notably, there should not have been a different handle_compute_task which emptied the who_has.

So either

my understanding of the above - particularly the last paragraph - is wrong, or
Worker.handle_compute_task passed an empty who_has to begin with - which should never happen? At the time the Scheduler issues the compute command, it should be convinced that all dependencies are in memory.

crusaderky · 2022-05-11T20:42:36Z

Updates and corrections:

update_who_has can't remove elements from who_has, only add them
scheduler.py::_task_to_msg() is sending a compute-task command with empty who_has. Unsure if it's called from Scheduler.transition_waiting_processing or from stealing.py. To my understanding this should never happen; however, this should also not be fatal: the dependency should simply transition from fetch to missing.
Worker.ensure_task_exists occasionally finds a task in Worker.tasks with state="released". This seems a fairly rare occurrence and not related to this specific test/issue. However, it makes no sense to me, as released should just be a degenerate state in the middle of Worker.transitions, and ensure_task_exists is never called in the middle of one.

crusaderky · 2022-05-11T21:51:23Z

Found it. The infinite transition loop is a one-liner typo introduced in #6248: https://github.com/dask/distributed/pull/6248/files#r870780929

Downstream of fixing the one-liner, I'm now getting:

7 failures out of 1000:

  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4281, in validate_task
    self.validate_task_released(ts)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 4252, in validate_task_released
    assert not any(ts.key in has_what for has_what in self.has_what.values())

This failure is (deliberately) introduced by #6318; I don't think it's a recent regression.
It's related to the fact I mentioned above that we have released tasks in Worker.tasks, outside of a transitions loop, and I can't think of any valid use case for it.

2 failures out of 1000:

  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2026, in handle_compute_task
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2795, in transitions
    a_recs, a_instructions = self._transition(
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2686, in _transition
    recs, instructions = func(ts, *args, stimulus_id=stimulus_id, **kwargs)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2045, in transition_missing_fetch
    assert ts.who_has
AssertionError

This again is caused by handle_compute_task with an empty who_has. We should either crash loudly in handle_compute_task or handle it gracefully and keep the task in missing state.

crusaderky · 2022-05-12T20:36:23Z

Downstream of #6327, I have only these failures left:

4 out of 1000

  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2044, in handle_compute_task
    assert dep_ts.state != "released", self.story(dep_ts)
AssertionError: [('ndarray-a373b7c1c84bcff2443e589a33117df5', 'ensure-task-exists', 'released', 'compute-task-1652371566.7855935', 1652371566.8192596)]

This is a new assertion, which detects earlier both of the issues listed in my post above.

crusaderky · 2022-05-13T11:25:52Z

I now have a better understanding of the error above:

 ('scatter-0', 'ensure-task-exists', 'released', 'compute-task-1652440434.4929621', 1652440434.501996),
 ('scatter-0', 'released', 'fetch', 'fetch', {}, 'compute-task-1652440434.4929621', 1652440434.5020301),
 ('gather-dependencies', 'tcp://127.0.0.1:33719', {'scatter-0'}, 'compute-task-1652440434.4929621', 1652440434.5020645),
 ('scatter-0', 'fetch', 'flight', 'flight', {}, 'compute-task-1652440434.4929621', 1652440434.502074),
 ('request-dep', 'tcp://127.0.0.1:33719', {'scatter-0'}, 'compute-task-1652440434.4929621', 1652440434.5090077),
 ('receive-dep', 'tcp://127.0.0.1:33719', {'scatter-0'}, 'compute-task-1652440434.4929621', 1652440434.5544157),
 ('scatter-0', 'put-in-memory', 'compute-task-1652440434.4929621', 1652440434.5544426),
 ('scatter-0', 'flight', 'memory', 'memory', {'slowadd-1-2': 'ready'}, 'compute-task-1652440434.4929621', 1652440434.5544457),

 ('slowadd-1-38', 'ensure-task-exists', 'released', 'compute-task-1652440434.8344328', 1652440434.8419213),
 ('slowadd-1-38', 'released', 'fetch', 'fetch', {}, 'compute-task-1652440434.8344328', 1652440434.841943),
 ('gather-dependencies', 'tcp://127.0.0.1:42613', {'slowadd-1-38'}, 'compute-task-1652440434.8344328', 1652440434.8419714),
 ('slowadd-1-38', 'fetch', 'flight', 'flight', {}, 'compute-task-1652440434.8344328', 1652440434.8419788),
 ('slowadd-1-38', 'flight', 'released', 'cancelled', {}, 'worker-close-1652440434.833075', 1652440434.842025),

 ('slowadd-1-38', 'compute-task', 'cancelled', 'compute-task-1652440434.8352494', 1652440434.8420353),
 ('scatter-3', 'ensure-task-exists', 'released', 'compute-task-1652440434.8352494', 1652440434.8420422),
 ('scatter-0', 'ensure-task-exists', 'memory', 'compute-task-1652440434.8352494', 1652440434.8420436),
 ('slowadd-1-38', 'cancelled', 'waiting', 'resumed', {}, 'compute-task-1652440434.8352494', 1652440434.8420599)]

'slowadd-1-38' is a dependent of 'scatter-0' and 'scatter-3', and a dependency of 'slowadd-2-37' and 'slowadd-2-38'.
What's happening above is that the scheduler asks the worker to compute 'slowadd-1-38', which already exists in cancelled state from a previous attempt to fetch it from tcp://127.0.0.1:42613 (which died in the meantime).
One of its two dependencies, 'scatter-3', is not in memory and needs to be fetched from somewhere else.

The transitions system calls transition_cancelled_waiting, which moves the task to resumed.
The task is never moved OUT of resumed, and crucially transition_released_waiting is never called.
As a direct consequence, 'scatter-3' is never transitioned from released to fetch.

There has been a recent change on this specific mechanism: #6217 @fjetter

Partially closes #6305

jakirkham · 2022-05-13T20:00:52Z

Given PR ( #6327 ) says it partially closes this issue, should we reopen or did the PR end up fixing this completely?

gjoseph92 · 2022-05-13T20:01:14Z

As I understand, #6305 (comment) is catching a real (but rarer) problem that still needs to be fixed. Reopening.

fjetter · 2022-05-20T08:06:34Z

Should be fixed by #6370

crusaderky · 2022-05-20T15:14:59Z

Not fixed - it's still failing 0.4% of the time on my desktop host and much more frequently on CI.

Stack trace:

  File "/home/crusaderky/github/distributed/distributed/worker.py", line 1990, in handle_compute_task
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2820, in transitions
    a_recs, a_instructions = self._transition(
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2709, in _transition
    recs, instructions = func(
  File "/home/crusaderky/github/distributed/distributed/worker.py", line 2014, in transition_missing_fetch
    assert ts.who_has

crusaderky added the regression label May 9, 2022

crusaderky assigned fjetter May 9, 2022

crusaderky mentioned this issue May 9, 2022

Refactor ensure_communicating #6165

Merged

crusaderky assigned crusaderky and unassigned fjetter May 9, 2022

crusaderky mentioned this issue May 10, 2022

Prevent infinite transition loops; more aggressive validate_state() #6318

Merged

crusaderky mentioned this issue May 11, 2022

Validate and debug state machine on handle_compute_task #6327

Merged

crusaderky mentioned this issue May 11, 2022

Race conditions from fetch to compute while AMM requests replica #6248

Merged

crusaderky mentioned this issue May 12, 2022

Release 2022.05.1 dask/community#245

Closed

crusaderky mentioned this issue May 13, 2022

Ensure resumed tasks are not accidentally forgotten #6217

Merged

2 tasks

mrocklin closed this as completed in #6327 May 13, 2022

mrocklin pushed a commit that referenced this issue May 13, 2022

Validate and debug state machine on handle_compute_task (#6327)

79d5a77

Partially closes #6305

gjoseph92 reopened this May 13, 2022

gjoseph92 mentioned this issue May 18, 2022

Improve catch_unhandled_exceptions #6358

Merged

2 tasks

fjetter added the flaky test Intermittent failures on CI. label May 18, 2022

This was referenced May 18, 2022

Ensure multiple clients can cancel their key without interference #6016

Merged

Ensure occupancy tracking works as expected for long running tasks #6351

Merged

gjoseph92 mentioned this issue May 18, 2022

Remove worker reconnect #6361

Merged

2 tasks

fjetter mentioned this issue May 19, 2022

Remove report and safe from Worker.close #6363

Merged

1 task

crusaderky assigned fjetter and unassigned crusaderky May 19, 2022

fjetter closed this as completed May 20, 2022

fjetter mentioned this issue May 20, 2022

Do not filter tasks before gathering data #6371

Merged

crusaderky reopened this May 20, 2022

fjetter mentioned this issue May 20, 2022

Fix test_stress_scatter_death #6404

Merged

quasiben mentioned this issue May 23, 2022

Disable HTTP API by default #6420

Merged

graingert mentioned this issue May 26, 2022

use asyncio.run to run gen_cluster, gen_test and cluster #6231

Merged

3 tasks

This was referenced Jun 2, 2022

Add config option to disable profiling and disable it in many tests per default #6490

Merged

Replace loop.call_later and loop.add_callback with background tasks added to Server. #6478

Closed

graingert mentioned this issue Jun 6, 2022

deprecate the io_loop and loop kwarg to Server, Worker, and Nanny #6473

Merged

2 tasks

crusaderky closed this as completed in #6404 Jun 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test_stress_scatter_death #6305

test_stress_scatter_death #6305

crusaderky commented May 9, 2022

crusaderky commented May 9, 2022

jakirkham commented May 10, 2022

crusaderky commented May 10, 2022

crusaderky commented May 11, 2022

crusaderky commented May 11, 2022

crusaderky commented May 11, 2022

crusaderky commented May 11, 2022

crusaderky commented May 11, 2022 •

edited

Loading

crusaderky commented May 12, 2022

crusaderky commented May 13, 2022 •

edited

Loading

jakirkham commented May 13, 2022

gjoseph92 commented May 13, 2022

fjetter commented May 20, 2022

crusaderky commented May 20, 2022

test_stress_scatter_death #6305

test_stress_scatter_death #6305

Comments

crusaderky commented May 9, 2022

crusaderky commented May 9, 2022

jakirkham commented May 10, 2022

crusaderky commented May 10, 2022

crusaderky commented May 11, 2022

crusaderky commented May 11, 2022

crusaderky commented May 11, 2022

crusaderky commented May 11, 2022

crusaderky commented May 11, 2022 • edited Loading

crusaderky commented May 12, 2022

crusaderky commented May 13, 2022 • edited Loading

jakirkham commented May 13, 2022

gjoseph92 commented May 13, 2022

fjetter commented May 20, 2022

crusaderky commented May 20, 2022

crusaderky commented May 11, 2022 •

edited

Loading

crusaderky commented May 13, 2022 •

edited

Loading