-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't utilise struct delta stats with delta_table.to_pyarrow_dataset() #653
Comments
Thanks for reporting this. It looks like we are reading this in Rust, but the struct stats are held in a separate field in the Add from the JSON-based stats: Lines 176 to 182 in 2ac3c3e
We currently call the function @houqp What would you think of unifying those two functions so |
There is no good reason to keep them separate other than being lazy :) @wjones127 what you proposed is the short term fix we should definitely implement. In the long run, when we switch to columnar in memory format for all the actions, we can perform the unification at parse time to convert both of them into the same in memory data structure. |
It looks like we are agreed on how to resolve this 🙂 . I made https://github.com/Tom-Newton/delta-rs/pull/3 which mostly fixes the functionality and proved to me the value of having this feature. I can try to tidy it up and to handle struct and array types etc but as somebody who is writing rust code for the first time I don't feel particularly confident in my ability to do that... |
# Description When a delta table's schema is evolved the struct stat schemas in checkpoints are also evolved. Since the struct stats are stored in a columnar way adding a single file with the new columns will cause nulls to appear in the struct stats for all other files. This is a significant difference compared to the json stats. Unfortunately I overlooked this in #656 for both nullCounts and min/max values. This caused parsed struct stats to have extra columns full of nulls. I don't know if this was actually an issue at all but it should be fixed even if just for the sake of the warnings spam. ``` [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null ``` # Related Issue(s) - Relates to #653 but for the most part its an already solved issue. # Changes: - Replace the test data with similar test data that includes a schema evolution. - Add error handling for min/max values to ensure warnings will be logged for other unexpected types (there probably shouldn't be any). As a total rust noob I originally filled with nulls but I think that was a mistake. - Ignore nulls for min/max stats and null count stats since these are expected after schema evolution and should be ignored without logging a warning. Usual disclaimer on a PR from me: I don't know what I'm doing writing rust code. (thanks to wjones for tidying up my dodgy rust code 🙂) Co-authored-by: Will Jones <[email protected]>
Environment
Delta-rs version: 0.5.7
Binding: Python
Environment:
Bug
What happened:
I want to utilise the cool new data skipping feature implemented by @wjones127 in #525 and #565.
However when using
delta_table.to_pyarrow_dataset()
, for a table which only has struct stats, it opens every parquet file unnecessarily.What you expected to happen:
The delta stats provide sufficient information to narrow it down to just one parquet file so it should only need to open the one file.
How to reproduce it:
Script to reproduce: reproduce_struct_stats_issue.zip
Steps this script does
"delta.checkpoint.writeStatsAsJson": "false"
and"delta.checkpoint.writeStatsAsStruct": "true"
.DeltaTable.to_pyarrow_dataset()
I have also created a branch with a unittest to catch this https://github.com/Tom-Newton/delta-rs/pull/1.
More details:
Logging the file and part_expressions returned:
We can see that the part is
None
for all the files added prior to the latest checkpoint. The 2 files after the latest checkpoint have correct part_expressions since the stats are simply read from the json commits.From looking at the code I think it never attempts to utilise struct stats in checkpoint files. Therefore if like me your tables are configured to use only struct stats in checkpoints you will miss out on this really cool feature.
I might have a go at making a PR for this but with my extremely limited rust knowledge it will be a challenge.
The text was updated successfully, but these errors were encountered: