-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Create datafusion-distributed crate with shuffle reader/writer #11070
Conversation
@thinkharderdev @Dandandan @avantgardnerio Just fyi and wanted to get your opinion on whether this is useful for you |
This seems like a good idea although I'm not sure we would use it directly as we have some fairly specific customizations we've added to Edit: Adding the distributed scheduler to this create would be great though and something we'd definitely be interested in using and contributing to, especially if if can abstract out the concrete implementation of actually shuffling data between stages |
I am also interested in this type of abstraction. I was thinking along the lines of having the planner insert |
FWIW we (InfluxData) would likely (never) end up using the shuffle reader / writer, nor a distributed query planner (we would have our own). From my perspective if there is more than one user of this code (e.g more than Ballista) then it makes sense to put it in datafusion. If there is realistically only one user of this code it probably doesn't belong here |
That makes sense, but there is also a chicken and egg situation with this. Let's see if anyone comments in favor of this, and if not, I will close the PR. |
I'm late to the party, but this would be helpful to build distributed services based on different Datafusion variants (e.g., LanceDB vs vanilla Datafusion.). |
Another potential option is to make a We could revisit the question of bringing the shuffle into the core once the exact shape and scope of the code was known |
Which issue does this PR close?
N/A
Rationale for this change
DataFusion is a great foundation for distributed systems, so let's make it easier to build distributed systems with DataFusion.
What changes are included in this PR?
New
datafusion-distributed
crate containing the shuffle reader and writer from Ballista. I would also like to add the distributed query planner into this crate in a future PR.We could also consider making improvements to the shuffle mechanism based on work happening in Comet.
Are these changes tested?
There will be unit tests once this PR is ready for review.
Are there any user-facing changes?
No