-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make TableProvider.scan() and PhysicalPlanner::create_physical_plan() async #1013
Conversation
fn create_initial_plan<'a>( | ||
&'a self, | ||
logical_plan: &'a LogicalPlan, | ||
ctx_state: &'a ExecutionContextState, | ||
) -> BoxFuture<'a, Result<Arc<dyn ExecutionPlan>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows as a big change but only because the function body needs to be wrapped into a BoxFuture
with async { .. }.boxed()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't this function signature be made async fn create_initial_plan....
? Why does it need a boxed future?
FWIW I found it easier to understand the diffs without whitespace: https://github.com/apache/arrow-datafusion/pull/1013/files?w=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of async's internals: https://rust-lang.github.io/async-book/07_workarounds/04_recursion.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see -- so that it can recursively call itself. 👍
I would recommend adding another rationale for this change which would be to "Support using DataFusion in systems where creating the scan requires network access (to, for example, find a list of files on some object store that match, or making a remote RPC call)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rdettai -- I only skimmed the changes to ballista -- I looked carefully at the changes to DataFusion. I really
like the change to make TableProvider.scan()
async as it opens up a host of ways to plan that can use network / other async resources
I think we should remove the changes in parallelism for Extension nodes and Union nodes from this PR as they appear unrelated and I am less sure they are good change
I really like this change and I think it brings a lot of flexibility to TableProviders
However, given it is a non trivial API change, I think we should get buy in from other parties as well
fn create_initial_plan<'a>( | ||
&'a self, | ||
logical_plan: &'a LogicalPlan, | ||
ctx_state: &'a ExecutionContextState, | ||
) -> BoxFuture<'a, Result<Arc<dyn ExecutionPlan>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't this function signature be made async fn create_initial_plan....
? Why does it need a boxed future?
FWIW I found it easier to understand the diffs without whitespace: https://github.com/apache/arrow-datafusion/pull/1013/files?w=1
ballista/rust/client/src/context.rs
Outdated
let execution_plan = ctx.create_physical_plan(&plan).await?; | ||
ctx.register_table( | ||
TableReference::Bare { table: name }, | ||
Arc::new(DfTableAdapter::new(plan, execution_plan)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel BallistaContext.sql(&str)
should not become async. Opening #1016 to track that.
This looks like a useful change to me! I think we can benefit from the Tokio runtime for doing remote calls. |
fn create_initial_plan<'a>( | ||
&'a self, | ||
logical_plan: &'a LogicalPlan, | ||
ctx_state: &'a ExecutionContextState, | ||
) -> BoxFuture<'a, Result<Arc<dyn ExecutionPlan>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see -- so that it can recursively call itself. 👍
It is great to bring async to the planning phase. The changes are neat. I've tried to prototype this kind of staff before and find this easy to follow. Great job @rdettai 👍 |
If we close #1028 first, I can rebase on that and avoid making |
@rdettai you can rebase now :) |
// be written to. As for eventual modifications that would be applied to the | ||
// original state after it has been cloned, they will not be picked up by the | ||
// clone but that is okay, as it is equivalent to postponing the state update | ||
// by keeping the lock until the end of the function scope. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks for the clear write up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rdettai for this great improvement. The change look good to me, good to go after we resolve the conflicts.
Any concerns with this change @Dandandan / @houqp ? If no one has any concerns, I would like to merge this as soon as it gets rebased and CI is passing Thanks again @rdettai |
No concern for me, +1 on merging it after CI is passing. |
Also +1 from me |
Rebase on #1016 allowed to remove async from
|
Thanks again @rdettai -- I think this is a great step forward towards supporting more object store / distributed planning systems with DataFusion ❤️ 🚀 |
Thank you all for your review time! The next step of the journey is #1010. It not the support of the object store yet, but that will also come soon ! 😉 |
Which issue does this PR close?
Closes #1012.
Rationale for this change
What changes are included in this PR?
#[async_trait]
,async fn
and.await
annotationsBoxFuture
for recursive async functionsExecutionContextState
to avoid making itSend
and propagate async to far in the APIAre there any user-facing changes?
API changes are relatively limited:
scan
asyncExecutionContext.create_physical_plan(&LogicalPlan)
is now async and thus needs to be.await
ed-> Avoid creating ExecutionPlan for registered tables in Ballista #1016 fixed thatBallistaContext.sql(&str)
is now async and thus needs to be.await
ed