Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt column statistics API #717

Open
Dandandan opened this issue Jul 13, 2021 · 2 comments
Open

Adapt column statistics API #717

Dandandan opened this issue Jul 13, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Jul 13, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While looking at adding support for more statistics on the Delta Lake TableProvider implementation I bumped into some limitation in our statistics API.

Currently columnstatistics is a Option<Vec<ColumnStatistics>>.

https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/datasource/datasource.rs#L37

So, it should return the statistics by (correct) index regardless of the order in the files.

Describe the solution you'd like
Either:

  • Return a HashMap<String, ColumnStatistics> rather than a Option<Vec<ColumnStatistics>>
  • Pass a Schema parameter to TableProvider::statisitics so the positions of the fields can be calculated.

FWIW, Delta Lake / delta-rs takes the first approach and seems straightforward to implement and use.

Describe alternatives you've considered

Additional context

@Dandandan Dandandan added the enhancement New feature or request label Jul 13, 2021
@Dandandan
Copy link
Contributor Author

Closing, seeing this could be done with the schema on table provider instead.

@rdettai
Copy link
Contributor

rdettai commented Sep 13, 2021

@Dandandan in #965 I used the schema from the ExecutionPlan trait and it worked fine. But I do agree that it might be better to come up with at data structure that helps asserting that the column_statistics vector is well aligned on the schema fields vector (same size, same types...). I'm adding this as an item in #997, so if you want to close this for now that's fine by me 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants