Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert skip of deadlock test [DNM yet] #212

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

ncclementi
Copy link
Contributor

Reverting skip after #166 (comment)

@ncclementi ncclementi changed the title Revert skip of deadlock test Revert skip of deadlock test [DNM yet] Jul 20, 2022
@ncclementi
Copy link
Contributor Author

This is still failing, but I'm not sure if the changes to the PR that fixes this error are yet available in the nightlies, I'll wait until tomorrow and re-triggered CI.

@ncclementi
Copy link
Contributor Author

So it looks like now we are still having intermittent errors but we moved from a TimeoutError to a CancelledError

New CI failure https://github.com/coiled/coiled-runtime/runs/7452563450?check_suite_focus=true

Previous CI failure: https://github.com/coiled/coiled-runtime/runs/7172469972?check_suite_focus=true#step:6:289

@gjoseph92 and @hendrikmakait How would you like to proceed here?

@gjoseph92
Copy link
Contributor

I'm not sure. Maybe this is a new thing. It would probably be best to just run the test manually and watch what it does. It's hard to tell just from logs.

@gjoseph92
Copy link
Contributor

This might also be the actual deadlock happening again. Briefly looking at the cluster dump, 50% of the stuck tasks are on a worker that had become unresponsive and the scheduler couldn't talk to anymore, a typical symptom of thrashing from page faults

@ncclementi
Copy link
Contributor Author

@fjetter How do you want to go about this?

We XFAIL this test because CI was failing intermittently but quite regularly, there was a proposed change, but we only got a different error message.

@gjoseph92
Copy link
Contributor

I reran the test manually and watched the dashboard. It failed twice and passed once. The dashboard looks exactly the same as what we were seeing in dask/distributed#6110 (comment):

Screen Shot 2022-08-04 at 4 42 49 PM

Workers have been unresponsive for multiple minutes. Clicking on their call stacks just hangs.

It's quite clear that this is just dask/distributed#6177. Most likely, the fix in dask/distributed#6177 just reduced the probability of getting into an out-of-memory page-thrashing state, but did not entirely prevent it from happening like a proper memory limit would. Anecdotally, it did take more iterations than it used to to trigger the failure.

This is expected. dask/distributed#6189 was just a heuristic; heuristics sometimes fail.

@gjoseph92 gjoseph92 mentioned this pull request Aug 4, 2022
@jrbourbeau
Copy link
Member

cc @fjetter for visibility. I believe @gjoseph92 is saying that we previously thought we had fixed this locking behavior, but it turns out we just decreased the likelihood for it to fail. Just wanted to put this on your radar so we can figure out how we want to proceed

@fjetter
Copy link
Member

fjetter commented Aug 10, 2022

We knew that the fundamental root cause cannot be fixed on dask side but were hoping the likelihood of failure was sufficiently reduced.
If this causes too many failures and is too noisy we need to wait for a fix on platform side and can keep skipping the test

@ntabris
Copy link
Member

ntabris commented Aug 10, 2022

wait for a fix on platform side

Is there more info somewhere about what would need to be fixed on platform side?

@fjetter
Copy link
Member

fjetter commented Aug 11, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants