Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Create datafusion-distributed crate with shuffle reader/writer #11070

Closed
wants to merge 2 commits into from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Jun 22, 2024

Which issue does this PR close?

N/A

Rationale for this change

DataFusion is a great foundation for distributed systems, so let's make it easier to build distributed systems with DataFusion.

What changes are included in this PR?

New datafusion-distributed crate containing the shuffle reader and writer from Ballista. I would also like to add the distributed query planner into this crate in a future PR.

We could also consider making improvements to the shuffle mechanism based on work happening in Comet.

Are these changes tested?

There will be unit tests once this PR is ready for review.

Are there any user-facing changes?

No

@andygrove
Copy link
Member Author

@thinkharderdev @Dandandan @avantgardnerio Just fyi and wanted to get your opinion on whether this is useful for you

@thinkharderdev
Copy link
Contributor

thinkharderdev commented Jun 22, 2024

@thinkharderdev @Dandandan @avantgardnerio Just fyi and wanted to get your opinion on whether this is useful for you

This seems like a good idea although I'm not sure we would use it directly as we have some fairly specific customizations we've added to ShuffleReader/ShuffleWriter internally

Edit: Adding the distributed scheduler to this create would be great though and something we'd definitely be interested in using and contributing to, especially if if can abstract out the concrete implementation of actually shuffling data between stages

@andygrove
Copy link
Member Author

Edit: Adding the distributed scheduler to this create would be great though and something we'd definitely be interested in using and contributing to, especially if if can abstract out the concrete implementation of actually shuffling data between stages

I am also interested in this type of abstraction. I was thinking along the lines of having the planner insert QueryStageExec instead of ShuffleWriterExec. QueryStageExec would not have any execution logic. Responsibility for execution can be left to users of this planner.

@alamb
Copy link
Contributor

alamb commented Jun 23, 2024

FWIW we (InfluxData) would likely (never) end up using the shuffle reader / writer, nor a distributed query planner (we would have our own).

From my perspective if there is more than one user of this code (e.g more than Ballista) then it makes sense to put it in datafusion. If there is realistically only one user of this code it probably doesn't belong here

@andygrove
Copy link
Member Author

From my perspective if there is more than one user of this code (e.g more than Ballista) then it makes sense to put it in datafusion. If there is realistically only one user of this code it probably doesn't belong here

That makes sense, but there is also a chicken and egg situation with this. Let's see if anyone comments in favor of this, and if not, I will close the PR.

@andygrove andygrove closed this Jun 27, 2024
@Epicism
Copy link

Epicism commented Jul 21, 2024

I'm late to the party, but this would be helpful to build distributed services based on different Datafusion variants (e.g., LanceDB vs vanilla Datafusion.).

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

Another potential option is to make a datafusion-shuffle repo in https://github.com/datafusion-contrib that we could prototype / collaborate what this would look like (and is lower overhead / commitment than a new crate in the core)

We could revisit the question of bringing the shuffle into the core once the exact shape and scope of the code was known

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants