Share your experiences with P2P shuffling #7509
Unanswered
hendrikmakait
asked this question in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Are you struggling with shuffling workloads that kill your scheduler or run out of memory? Please try out P2P shuffling and let us know how it works for you!
In https://www.coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept, we presented a new approach to DataFrame shuffling that significantly reduced task overhead and allowed shuffling much larger datasets. Since then, we have taken this prototype, called P2P shuffling, made it more stable, and removed potential deadlocks. Now, it is ready to be tried out by the community as an experimental feature.
What is shuffling and when is it used?
Source: https://www.coiled.io/blog/dask-as-a-spark-replacement
Shuffling is a key part of many dataframe workloads. Among others,
set_index
,sort
,merge
,aggregate
,shuffle
,groupby(..., split_out>1)
, andgroupby(...).apply
are common operations that are (sometimes) backed by shuffling.Note that, with
rechunking
, a shuffle-like operation is also used in array workloads, but P2P shuffling is currently limited to dataframes. If you want to follow the implementation effort for array rechunking and further improvements to the existing P2P algorithm, subscribe to #7507.How do I use P2P shuffling?
To use the P2P shuffle implementation, you can use it either by (temporarily) setting Dask configuration
or explicitly setting the
shuffle
keyword.As mentioned above, shuffling is key to many common operations in dataframe workloads such as
set_index
,sort
,merge
,aggregate
,groupby(.., split_out>1)
, andgroupby(...).apply
. These operations can also benefit from using the P2P shuffling implementation and can be configured analogously to theshuffle
operation by using the context manageror the
shuffle
keyword argument.What are the benefits?
P2P shuffling significantly reduces the size of the task graph from O(N²) to O(N) which allows you to run much larger shuffles that would have previously crashed the scheduler. Additionally, it may significantly improve resource usage resulting in a smaller memory footprint or a higher network throughput:
What are the downsides?
Most importantly, P2P shuffling relies on disk storage to store intermediate results. If you do not have sufficient (or any) disk space available, P2P shuffling will not work for you.
Furthermore, P2P shuffling does not support all
dtypes
available withtasks
-based shuffling such asobject
,complex
numbers, orsparse
data.While we have put significant effort into testing P2P shuffling (coiled/coiled-runtime/#569) and removing potential deadlocks (#7326 ), it lacks widespread user testing in a broader set of use cases. Balancing resource usage is a delicate task, and the current implementation may not work for every task. This is why we would love to hear from you about the performance improvements or degradations you encounter when using P2P shuffling.
Apart from that, there are some known caveats:
queueing
is enabled,dataframe.merge
may result in higher memory usage than without queueing (#7496). However, it is still performing better thantasks
-based shuffling.Which version of
distributed
should I use to try P2P shuffling?To use P2P shuffling, make sure to install
distributed
directly from Github usingto benefit from the latest improvements. You will also need to install
pyarrow>=7
which is used under the hood.I have tried out P2P shuffling, now what?
Please share your experience by replying to this discussion (in a new thread)! We would love to hear what worked and what didn't. To get the most value out of your experience, we ask you to include your cluster size (# workers, worker CPU/memory) as well as your data size (total size, # partitions). Performance reports, memory samples, or dashboard screenshots are also greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions