Revert skip of deadlock test [DNM yet] #212

ncclementi · 2022-07-20T17:20:23Z

Reverting skip after #166 (comment)

ncclementi · 2022-07-20T18:12:16Z

This is still failing, but I'm not sure if the changes to the PR that fixes this error are yet available in the nightlies, I'll wait until tomorrow and re-triggered CI.

ncclementi · 2022-07-21T15:59:32Z

So it looks like now we are still having intermittent errors but we moved from a TimeoutError to a CancelledError

New CI failure https://github.com/coiled/coiled-runtime/runs/7452563450?check_suite_focus=true

Previous CI failure: https://github.com/coiled/coiled-runtime/runs/7172469972?check_suite_focus=true#step:6:289

@gjoseph92 and @hendrikmakait How would you like to proceed here?

gjoseph92 · 2022-07-21T16:50:30Z

I'm not sure. Maybe this is a new thing. It would probably be best to just run the test manually and watch what it does. It's hard to tell just from logs.

gjoseph92 · 2022-07-21T16:56:09Z

This might also be the actual deadlock happening again. Briefly looking at the cluster dump, 50% of the stuck tasks are on a worker that had become unresponsive and the scheduler couldn't talk to anymore, a typical symptom of thrashing from page faults

ncclementi · 2022-08-04T21:18:03Z

@fjetter How do you want to go about this?

We XFAIL this test because CI was failing intermittently but quite regularly, there was a proposed change, but we only got a different error message.

gjoseph92 · 2022-08-04T23:01:05Z

I reran the test manually and watched the dashboard. It failed twice and passed once. The dashboard looks exactly the same as what we were seeing in dask/distributed#6110 (comment):

Workers have been unresponsive for multiple minutes. Clicking on their call stacks just hangs.

It's quite clear that this is just dask/distributed#6177. Most likely, the fix in dask/distributed#6177 just reduced the probability of getting into an out-of-memory page-thrashing state, but did not entirely prevent it from happening like a proper memory limit would. Anecdotally, it did take more iterations than it used to to trigger the failure.

This is expected. dask/distributed#6189 was just a heuristic; heuristics sometimes fail.

jrbourbeau · 2022-08-09T20:11:24Z

cc @fjetter for visibility. I believe @gjoseph92 is saying that we previously thought we had fixed this locking behavior, but it turns out we just decreased the likelihood for it to fail. Just wanted to put this on your radar so we can figure out how we want to proceed

fjetter · 2022-08-10T13:10:13Z

We knew that the fundamental root cause cannot be fixed on dask side but were hoping the likelihood of failure was sufficiently reduced.
If this causes too many failures and is too noisy we need to wait for a fix on platform side and can keep skipping the test

ntabris · 2022-08-10T13:13:22Z

wait for a fix on platform side

Is there more info somewhere about what would need to be fixed on platform side?

fjetter · 2022-08-11T13:13:23Z

@ntabris https://github.com/coiled/product/issues/5

revert skip of deadlock test

233766d

ncclementi changed the title ~~Revert skip of deadlock test~~ Revert skip of deadlock test [DNM yet] Jul 20, 2022

ncclementi mentioned this pull request Jul 20, 2022

⚠️ CI failed ⚠️ - test_deadlock fails intermittently #166

Open

gjoseph92 mentioned this pull request Aug 4, 2022

Use package sync #222

Closed

hendrikmakait mentioned this pull request Feb 1, 2023

[flaky] test_repeated_merge_spill raises concurrent.futures._base.CancelledError #681

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert skip of deadlock test [DNM yet] #212

Revert skip of deadlock test [DNM yet] #212

ncclementi commented Jul 20, 2022

ncclementi commented Jul 20, 2022

ncclementi commented Jul 21, 2022

gjoseph92 commented Jul 21, 2022

gjoseph92 commented Jul 21, 2022

ncclementi commented Aug 4, 2022

gjoseph92 commented Aug 4, 2022

jrbourbeau commented Aug 9, 2022

fjetter commented Aug 10, 2022

ntabris commented Aug 10, 2022

fjetter commented Aug 11, 2022

Revert skip of deadlock test [DNM yet] #212

Are you sure you want to change the base?

Revert skip of deadlock test [DNM yet] #212

Conversation

ncclementi commented Jul 20, 2022

ncclementi commented Jul 20, 2022

ncclementi commented Jul 21, 2022

gjoseph92 commented Jul 21, 2022

gjoseph92 commented Jul 21, 2022

ncclementi commented Aug 4, 2022

gjoseph92 commented Aug 4, 2022

jrbourbeau commented Aug 9, 2022

fjetter commented Aug 10, 2022

ntabris commented Aug 10, 2022

fjetter commented Aug 11, 2022