-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert skip of deadlock test [DNM yet] #212
base: main
Are you sure you want to change the base?
Conversation
This is still failing, but I'm not sure if the changes to the PR that fixes this error are yet available in the nightlies, I'll wait until tomorrow and re-triggered CI. |
So it looks like now we are still having intermittent errors but we moved from a TimeoutError to a CancelledError New CI failure https://github.com/coiled/coiled-runtime/runs/7452563450?check_suite_focus=true Previous CI failure: https://github.com/coiled/coiled-runtime/runs/7172469972?check_suite_focus=true#step:6:289 @gjoseph92 and @hendrikmakait How would you like to proceed here? |
I'm not sure. Maybe this is a new thing. It would probably be best to just run the test manually and watch what it does. It's hard to tell just from logs. |
This might also be the actual deadlock happening again. Briefly looking at the cluster dump, 50% of the stuck tasks are on a worker that had become unresponsive and the scheduler couldn't talk to anymore, a typical symptom of thrashing from page faults |
@fjetter How do you want to go about this? We XFAIL this test because CI was failing intermittently but quite regularly, there was a proposed change, but we only got a different error message. |
I reran the test manually and watched the dashboard. It failed twice and passed once. The dashboard looks exactly the same as what we were seeing in dask/distributed#6110 (comment): Workers have been unresponsive for multiple minutes. Clicking on their call stacks just hangs. It's quite clear that this is just dask/distributed#6177. Most likely, the fix in dask/distributed#6177 just reduced the probability of getting into an out-of-memory page-thrashing state, but did not entirely prevent it from happening like a proper memory limit would. Anecdotally, it did take more iterations than it used to to trigger the failure. This is expected. dask/distributed#6189 was just a heuristic; heuristics sometimes fail. |
cc @fjetter for visibility. I believe @gjoseph92 is saying that we previously thought we had fixed this locking behavior, but it turns out we just decreased the likelihood for it to fail. Just wanted to put this on your radar so we can figure out how we want to proceed |
We knew that the fundamental root cause cannot be fixed on dask side but were hoping the likelihood of failure was sufficiently reduced. |
Is there more info somewhere about what would need to be fixed on platform side? |
Reverting skip after #166 (comment)