Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't utilise struct delta stats with delta_table.to_pyarrow_dataset() #653

Closed
Tom-Newton opened this issue Jun 20, 2022 · 3 comments · Fixed by #656
Closed

Can't utilise struct delta stats with delta_table.to_pyarrow_dataset() #653

Tom-Newton opened this issue Jun 20, 2022 · 3 comments · Fixed by #656
Labels
bug Something isn't working

Comments

@Tom-Newton
Copy link
Contributor

Environment

Delta-rs version: 0.5.7

Binding: Python

Environment:

  • Cloud provider: Azure (bug is reproduced locally)
  • OS: Ubuntu 18.04
  • Other: Python 3.8

Bug

What happened:
I want to utilise the cool new data skipping feature implemented by @wjones127 in #525 and #565.
However when using delta_table.to_pyarrow_dataset(), for a table which only has struct stats, it opens every parquet file unnecessarily.

What you expected to happen:
The delta stats provide sufficient information to narrow it down to just one parquet file so it should only need to open the one file.

How to reproduce it:
Script to reproduce: reproduce_struct_stats_issue.zip

Steps this script does

  1. Create a delta table with properties: "delta.checkpoint.writeStatsAsJson": "false" and "delta.checkpoint.writeStatsAsStruct": "true".
  2. Append rows to the table until it creates a checkpoint.
  3. Open the table using DeltaTable.to_pyarrow_dataset()
  4. Get the parquet fragments when applying a filter. Check the number of fragments is equal to the number of files that could match the filter expression when you have knowledge of the delta stats.

I have also created a branch with a unittest to catch this https://github.com/Tom-Newton/delta-rs/pull/1.

More details:
Logging the file and part_expressions returned:

/tmp/reproduce_stats_issue/part-00000-431a48bb-dc24-438f-8329-60e9eae4ca22-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-1cbe22db-1188-46b9-b52d-e1c4a3ab6b61-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-c84ba96d-a42a-441f-b440-a22c1a48d344-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-beb74db9-8db2-431d-9be0-a437a01abe47-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-3c21a7a2-0405-4d33-8be5-b77286af386a-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-21283900-d6b2-48bc-94d7-88b554e32402-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-2ab94ca6-f2de-4b0f-869f-deccad42990c-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-7ff0882e-072b-4fe8-8642-1e378360395c-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-4140aad9-87f4-4c5a-a47a-6ad4ee4e57bf-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-924efae5-8d48-4e1e-a33f-11877165f0a0-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-15d64514-967b-4ac4-b241-063a6b2f38e6-c000.snappy.parquet (((a >= 10) and (a <= 10)) and is_valid(a))
/tmp/reproduce_stats_issue/part-00000-25717410-d4c8-4c0f-b209-ff1aa56905e6-c000.snappy.parquet (((a >= 11) and (a <= 11)) and is_valid(a))

We can see that the part is None for all the files added prior to the latest checkpoint. The 2 files after the latest checkpoint have correct part_expressions since the stats are simply read from the json commits.

From looking at the code I think it never attempts to utilise struct stats in checkpoint files. Therefore if like me your tables are configured to use only struct stats in checkpoints you will miss out on this really cool feature.

I might have a go at making a PR for this but with my extremely limited rust knowledge it will be a challenge.

@Tom-Newton Tom-Newton added the bug Something isn't working label Jun 20, 2022
@wjones127
Copy link
Collaborator

Thanks for reporting this.

It looks like we are reading this in Rust, but the struct stats are held in a separate field in the Add from the JSON-based stats:

delta-rs/rust/src/action.rs

Lines 176 to 182 in 2ac3c3e

/// Contains statistics (e.g., count, min/max values for columns) about the data in this file in
/// raw parquet format. This field needs to be written when statistics are available and the
/// table property: delta.checkpoint.writeStatsAsStruct is set to true.
///
/// This field is only available in add action records read from checkpoints
#[serde(skip_serializing, skip_deserializing)]
pub stats_parsed: Option<parquet::record::Row>,

We currently call the function get_stats() on add actions, which only gets the JSON ones. There is a separate function get_stats_parsed() that gets the struct one. They return different types.

@houqp What would you think of unifying those two functions so get_stats() returns JSON ones or parquet ones depending on what's available? Or is there good reason to keep them separate?

@houqp
Copy link
Member

houqp commented Jun 21, 2022

There is no good reason to keep them separate other than being lazy :) @wjones127 what you proposed is the short term fix we should definitely implement. In the long run, when we switch to columnar in memory format for all the actions, we can perform the unification at parse time to convert both of them into the same in memory data structure.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Jun 23, 2022

It looks like we are agreed on how to resolve this 🙂 . I made https://github.com/Tom-Newton/delta-rs/pull/3 which mostly fixes the functionality and proved to me the value of having this feature.

I can try to tidy it up and to handle struct and array types etc but as somebody who is writing rust code for the first time I don't feel particularly confident in my ability to do that...

wjones127 added a commit that referenced this issue Nov 3, 2022
# Description
When a delta table's schema is evolved the struct stat schemas in
checkpoints are also evolved. Since the struct stats are stored in a
columnar way adding a single file with the new columns will cause nulls
to appear in the struct stats for all other files. This is a significant
difference compared to the json stats.

Unfortunately I overlooked this in
#656 for both nullCounts and
min/max values. This caused parsed struct stats to have extra columns
full of nulls. I don't know if this was actually an issue at all but it
should be fixed even if just for the sake of the warnings spam.
```
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
[2022-10-24T22:13:22Z WARN  deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null
``` 

# Related Issue(s)
- Relates to #653 but for the
most part its an already solved issue.

# Changes:
- Replace the test data with similar test data that includes a schema
evolution.
- Add error handling for min/max values to ensure warnings will be
logged for other unexpected types (there probably shouldn't be any). As
a total rust noob I originally filled with nulls but I think that was a
mistake.
- Ignore nulls for min/max stats and null count stats since these are
expected after schema evolution and should be ignored without logging a
warning.

Usual disclaimer on a PR from me: I don't know what I'm doing writing
rust code. (thanks to wjones for tidying up my dodgy rust code 🙂)

Co-authored-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants