Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't repartition ProjectionExec when it does not compute anything #4968

Closed
alamb opened this issue Jan 18, 2023 · 2 comments · Fixed by #5074
Closed

Don't repartition ProjectionExec when it does not compute anything #4968

alamb opened this issue Jan 18, 2023 · 2 comments · Fixed by #5074
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers optimizer Optimizer rules

Comments

@alamb
Copy link
Contributor

alamb commented Jan 18, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
ProjectionExec can either have computations like (col1 + col2) or it can be used to reorder / rename the columns

The first use case benefits from repartitioning (as then the calculation can be done in multiple cores)

The second use case (ordering) does not benefit from partitioning as it is simply a bookkeeping arrangement

Basically we have a plan like

ProjectionExec: expr=[f@0 as f]
  DeduplicateExec: [tag@1 ASC,time@2 ASC]
    SortPreservingMergeExec: [tag@1 ASC,time@2 ASC]
      UnionExec

That is then optimized by https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_optimizer/repartition.rs to repartition before the projection

ProjectionExec: expr=[f@0 as f]
  RepartitionExec: partitioning=RoundRobinBatch(4) <-- This repartition node is likely worthless
    DeduplicateExec: [tag@1 ASC,time@2 ASC]
      SortPreservingMergeExec: [tag@1 ASC,time@2 ASC]
        UnionExec

Describe the solution you'd like
This I think ProjectionExec should only "benefit from partitioning" when its partition expressions actually have calculations (aka are not just columns / aliases)

This would like defining benefits_from_input_partitioning
https://github.com/apache/arrow-datafusion/blob/906896b7c59ff14d71b3056ec4349274cf6662af/datafusion/core/src/physical_plan/mod.rs#L176-L183

For impl ExecutionPlan for ProjectionExec: https://github.com/apache/arrow-datafusion/blob/906896b7c59ff14d71b3056ec4349274cf6662af/datafusion/core/src/physical_plan/projection.rs#L151

So that it returned true only if there were expressions that had non column references / aliases

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context

I think this is a good first issue as the code and desire is fairly straightforward and this would largely be an exercise in updating tests I suspect

@alamb alamb added enhancement New feature or request good first issue Good for newcomers optimizer Optimizer rules labels Jan 18, 2023
@xiaoyong-z
Copy link
Contributor

I'm intersted on this issue, please assign it to me if no one else is already working on it!

@alamb
Copy link
Contributor Author

alamb commented Jan 20, 2023

Thanks @xiaoyong-z -- I don't know anyone else working on it. I have assigned it to you

A good way to start might be to make the change to ProjectionExec and see what tests in repartition fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers optimizer Optimizer rules
Projects
None yet
2 participants