Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement physical execution of uncorrelated scalar subqueries #3781

Open
Tracked by #822
andygrove opened this issue Oct 10, 2022 · 5 comments
Open
Tracked by #822

Implement physical execution of uncorrelated scalar subqueries #3781

andygrove opened this issue Oct 10, 2022 · 5 comments
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@andygrove
Copy link
Member

andygrove commented Oct 10, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We currently support uncorrelated scalar subqueries by translating them into a cross-join. It would likely be more efficient to execute the subquery and update the original plan with the scalar value.

We also need to do this so we can throw an error if more than 1 row is returned and there is no way to do this in the logical plan.

Describe the solution you'd like

The optimizer will need some kind of trait that execution engines (DataFusion, Ballista, Dask SQL, etc) can implement for resolving scalar values.

trait OptimizerContext {
  fn supports_scalar_subquery_execution() -> bool;
  fn execute_scalar_subquery(plan: &LogicalPlan) -> Result<ScalarValue>;
}

Describe alternatives you've considered

Additional context

@andygrove andygrove added enhancement New feature or request performance Make DataFusion faster labels Oct 10, 2022
@Dandandan
Copy link
Contributor

Dandandan commented Oct 10, 2022

We can also remove the restriction on the conversion that the subquery should contain an aggregate, and perform the check of being scalar in the physical node.

@andygrove
Copy link
Member Author

andygrove commented Oct 10, 2022

We can also remove the restriction on the conversion that the subquery should contain an aggregate, and perform the check of being scalar in the physical node.

Related issue: #3725

@xudong963
Copy link
Member

We can just change it to left join, and add logic to check if more than 1 row is returned in left join.

@Dandandan
Copy link
Contributor

We can just change it to left join, and add logic to check if more than 1 row is returned in left join.

I don't think that's possible. A cross join is used as that doesn't require a join condition. The cross join is less efficient however, as it repeats the value for the scalar as much as there are values on the left side, and then uses that to filter on (which is less efficient than using a scalar in the filter).

@xudong963
Copy link
Member

A cross join is used as that doesn't require a join condition

Oh, my mistake. The tile is uncorrelated scalar subquery.

For correlated scalar subquery, we can change it to left join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

3 participants