Share your experiences with P2P shuffling #7509

hendrikmakait · 2023-01-27T20:34:27Z

hendrikmakait
Jan 27, 2023
Collaborator

Are you struggling with shuffling workloads that kill your scheduler or run out of memory? Please try out P2P shuffling and let us know how it works for you!

In https://www.coiled.io/blog/better-shuffling-in-dask-a-proof-of-concept, we presented a new approach to DataFrame shuffling that significantly reduced task overhead and allowed shuffling much larger datasets. Since then, we have taken this prototype, called P2P shuffling, made it more stable, and removed potential deadlocks. Now, it is ready to be tried out by the community as an experimental feature.

What is shuffling and when is it used?

When datasets need to be sorted or joined with other very large datasets this requires a full dataset communication. Every input chunk of data needs to contribute a little bit to every output chunk of data. This operation is commonly called a shuffle.

Source: https://www.coiled.io/blog/dask-as-a-spark-replacement

Shuffling is a key part of many dataframe workloads. Among others, set_index, sort, merge, aggregate, shuffle, groupby(..., split_out>1), and groupby(...).apply are common operations that are (sometimes) backed by shuffling.

Note that, with rechunking, a shuffle-like operation is also used in array workloads, but P2P shuffling is currently limited to dataframes. If you want to follow the implementation effort for array rechunking and further improvements to the existing P2P algorithm, subscribe to #7507.

How do I use P2P shuffling?

To use the P2P shuffle implementation, you can use it either by (temporarily) setting Dask configuration

with dask.config.set(shuffle='p2p'):
    ddf.shuffle()

or explicitly setting the shuffle keyword.

ddf.shuffle(shuffle="p2p")

As mentioned above, shuffling is key to many common operations in dataframe workloads such as set_index, sort, merge, aggregate, groupby(.., split_out>1), and groupby(...).apply. These operations can also benefit from using the P2P shuffling implementation and can be configured analogously to the shuffle operation by using the context manager

with dask.config.set(shuffle="p2p"):
    ddf.groupby(...).apply(...)

or the shuffle keyword argument.

ddf.set_index(column, shuffle='p2p')

What are the benefits?

P2P shuffling significantly reduces the size of the task graph from O(N²) to O(N) which allows you to run much larger shuffles that would have previously crashed the scheduler. Additionally, it may significantly improve resource usage resulting in a smaller memory footprint or a higher network throughput:

What are the downsides?

Most importantly, P2P shuffling relies on disk storage to store intermediate results. If you do not have sufficient (or any) disk space available, P2P shuffling will not work for you.

Furthermore, P2P shuffling does not support all dtypes available with tasks-based shuffling such as object, complex numbers, or sparse data.

While we have put significant effort into testing P2P shuffling (coiled/coiled-runtime/#569) and removing potential deadlocks (#7326 ), it lacks widespread user testing in a broader set of use cases. Balancing resource usage is a delicate task, and the current implementation may not work for every task. This is why we would love to hear from you about the performance improvements or degradations you encounter when using P2P shuffling.

Apart from that, there are some known caveats:

When queueing is enabled, dataframe.merge may result in higher memory usage than without queueing (#7496). However, it is still performing better than tasks-based shuffling.
When forgetting a shuffle on the client and re-submitting it too fast, the re-submitted execution may fail due to #7489. This situation should arise in an interactive session and can be fixed by waiting a bit to allow canceled tasks to finish on the worker and submitting the workload yet again.

Which version of `distributed` should I use to try P2P shuffling?

To use P2P shuffling, make sure to install distributed directly from Github using

pip install git+https://github.com/dask/distributed@main

to benefit from the latest improvements. You will also need to install pyarrow>=7 which is used under the hood.

I have tried out P2P shuffling, now what?

Please share your experience by replying to this discussion (in a new thread)! We would love to hear what worked and what didn't. To get the most value out of your experience, we ask you to include your cluster size (# workers, worker CPU/memory) as well as your data size (total size, # partitions). Performance reports, memory samples, or dashboard screenshots are also greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share your experiences with P2P shuffling #7509

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Share your experiences with P2P shuffling #7509

hendrikmakait Jan 27, 2023 Collaborator

What is shuffling and when is it used?

How do I use P2P shuffling?

What are the benefits?

What are the downsides?

Which version of distributed should I use to try P2P shuffling?

I have tried out P2P shuffling, now what?

Replies: 0 comments

hendrikmakait
Jan 27, 2023
Collaborator

Which version of `distributed` should I use to try P2P shuffling?