Adaptive downscaling causes P2P to restart #8579
Labels
adaptive
All things relating to adaptive scaling
enhancement
Improve existing functionality or make things work better
shuffle
When adaptive scaling decides to retire a worker that currently participates in a P2P shuffle, it causes the entire shuffle to get restarted. As reported in https://dask.discourse.group/t/shuffle-p2p-unstable-with-adaptive-k8s-operator/2600, this isn't a great UX.
While I'm hesitant to suggest adding even more complexity to P2P, it might make sense to think about a mechanism for P2P to "block" workers from being retired. I'm not sure if a hard block is generally desirable let alone if we can find a mechanism that allows for loose coupling, so this issue is mostly a discussion starter for now.
The text was updated successfully, but these errors were encountered: