Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize count(col) using table statistics #904

Closed
Dandandan opened this issue Aug 18, 2021 · 13 comments · Fixed by #1063
Closed

Optimize count(col) using table statistics #904

Dandandan opened this issue Aug 18, 2021 · 13 comments · Fixed by #1063
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

If both the number of rows and the number of nulls are available as statistic, we can compute the value based on statistics only when reading.
row_count - null_count(row)

Describe the solution you'd like
Add an optimization rule that matches this count expression with column and use the statistics to replace it with the value.

Describe alternatives you've considered

Additional context

@Dandandan Dandandan added enhancement New feature or request performance Make DataFusion faster labels Aug 18, 2021
@Dandandan Dandandan changed the title Optimize coun(col) using table statistics Optimize count(col) using table statistics Aug 18, 2021
@matthewmturner
Copy link
Contributor

@Dandandan i wanted to look into this but saw that it was mentioned in #962 as having potential merge conflicts. Is it ok for me to start looking into this? Or should I hold off for now?

@Dandandan
Copy link
Contributor Author

Maybe best to consult @rdettai about what is timing-wise the best way to implement this particular optimization.

@rdettai
Copy link
Contributor

rdettai commented Sep 9, 2021

in #965 I have moved that optimization to the physical plan, so if the PR gets accepted, this optimization should definitively be done there instead of the logical plan

@matthewmturner
Copy link
Contributor

@rdettai ok, will wait for that then.

just for my understanding, i thought this was a new optimization rule, so how was it moved to the physical plan already?

@houqp
Copy link
Member

houqp commented Sep 10, 2021

@matthewmturner @rdettai was saying he moved all existing cost based optimizers (not this particular one) into physical plane in that PR :)

@rdettai
Copy link
Contributor

rdettai commented Sep 10, 2021

actually, If I'm not mistaken this optimization would go into the AggregateStatistics rule, which is one of the CBOs that has moved from being an OptimizerRule to a PhysicalOptimizerRule 😃

@matthewmturner
Copy link
Contributor

@rdettai @houqp thx both for info. so my understanding is that this is an optimization that has not yet been implemented, but it will ultimately belong to the AggregateStatistics rule. Once the CBO have been moved to the physical level I can work on this?

@rdettai
Copy link
Contributor

rdettai commented Sep 10, 2021

Exactly! You can already take a try at it from my PR. But you would take the risk that if it does not get merged the work has to be done again on the OptimizerRule version 🙂

@matthewmturner
Copy link
Contributor

@rdettai glad your PR got merged :) now that it is, is it a good time to try working on this?

@rdettai
Copy link
Contributor

rdettai commented Sep 17, 2021

Yes sure!

@matthewmturner
Copy link
Contributor

@Dandandan @rdettai i was going to add this to the existing take_optimizable_count rule as it seems sufficiently similar. do you agree this is right place or is a new rule needed?

@rdettai
Copy link
Contributor

rdettai commented Sep 20, 2021

I would write a separate take_optimizable_non_null_count, or something like that, but if you have a better idea to structure this part of the code feel free to propose it in a PR

@matthewmturner
Copy link
Contributor

@rdettai got it, thanks. Im still just working on getting an understanding of the existing rules. Would you be able to provide some more details on why there is a lit_expr for the count rule and its equal to &ScalarValue::UInt8(Some(1))? My naive expectation was to have something similar to the max_min where a col_expr was made and the num_rows / null_count from the ColumnStatistics could be used.

thx in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants