-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement DataFrame.write_*
to use LogicalPlan::Write
#5076
Comments
DataFrame.write_*
to use LogicalPlan::Dml
DataFrame.write_*
to use LogicalPlan::Write
We can also support
properties, i.e support on |
I think we should stay true to the design goal of DataFusion and keep this functionality as modular as possible (aka implemented in terms of traits that can be extended by other systems). Here are some ideas: Idea: Add a physical plan for LogicalPlan::DML(I think this is what @andygrove is suggesting). This would add a way to create a physical plan for The upside here is we already have all the flow and planner and it would follow the pattern of system like spark (e.g. DataWritingCommandExec -- thanks to @metesynnada for the link) The downsides are that such an ExecutionPlan is kind of strange (it makes no output, so therefore most of the methods like "output ordering" are basically useless) as I mentioned on #6049 (review) Idea: Add specific runner / executor for Inserts / Update / DeletesMaybe we could provide a function or struct
Here is how you might run it: let runner = Insert::new(context)
.target(my_table)
.run(target)? A benefit here is that only systems that wanted to handle DML would invoke the inserter. A downside is that it would require more code / connections to work Maybe @avantgardnerio has some thoughts in this area, as I think he has a system that does DML as well based on DataFusion |
In addition to the points r.e. sort order, etc... from a scheduling / partitioning standpoint returning ExecutionPlan is perhaps not ideal.
I'm likely missing something but does this even need to be a separate abstraction, is this not an implementation detail of The only thing I can possibly think of is |
I think something needs to call It is a very reasonable question about how much code that requires and if it should be done in DataFuson or outside |
Tbh I'm more concerned that we provide sufficient API extensibility for different Edit: #6109 extracts some common logic to this end |
I find Idea 1 reasonable. If we think of I am curious to hear what @andygrove thinks about @alamb's points, though. |
I think it's useful to have write operations be able to produce outputs. In many use cases you actually want metadata about what was written (file locations, total bytes, etc). This seems like a natural way to, for example, update a catalog with the new table. |
I don't have a strong opinion here, since I haven't had much time to look in depth and we fork execution at the LogicalPlan level before it ever gets to the physical plan anyway. I do see how adding support for DML in TableProvider and PhysicalExec would be the next logical step. I would like to weigh in and say if we do that:
|
I'm possibly too influenced by Spark's approach, but option 1 seems to work well. For example, Ballista has a That said, I have stronger opinions about including the logical plan representation than I do about how this is implemented in the physical plan. |
@andygrove, can you take a look at #6049? It implements an |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We have DataFrame methods such as
write_parquet
that create a physical plan and execute it, then write the results to disk. This was implemented this way because, at the time, there was no "write" concept in the logical plan.We now have
LogicalPlan:Dml
for Insert/Update/Delete, but I think we need a more genericLogicalPlan::Write
operation, as well as a physical operator to perform the write. This is a common pattern for ETL tasks that execute a query, perform a transformation, and then write the results to disk.Describe the solution you'd like
As described.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: