Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ballista should serialize Parquet statistics #14

Open
andygrove opened this issue Aug 13, 2021 · 1 comment
Open

Ballista should serialize Parquet statistics #14

andygrove opened this issue Aug 13, 2021 · 1 comment
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When the Ballista scheduler or executor deserializes a ParquetExec it collects the statistics again and this is redundant. We should serialize the statistics to avoid this extra work.

Describe the solution you'd like
Add Parquet statistics to serde module.

Describe alternatives you've considered
N/A

Additional context
N/A

@rdettai
Copy link
Contributor

rdettai commented Aug 31, 2021

In apache/datafusion#962 I am considering the possibility to make the statistics part of the ExecutionPlan trait (and remove them from TableProvider). But I think that not all nodes will have a cached version of the statistics, only those nodes for which it is an expensive operation to fetch them and that know that the they will not change.

We will probably not need the statistics on the executor, because I doubt that any re-optimization will take place there. So it might be an optimization further down the road to optionally leave them out of the serialization in that case.

@andygrove andygrove transferred this issue from apache/datafusion May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants