Can't utilise struct delta stats with delta_table.to_pyarrow_dataset() #653

Tom-Newton · 2022-06-20T11:41:11Z

Environment

Delta-rs version: 0.5.7

Binding: Python

Environment:

Cloud provider: Azure (bug is reproduced locally)
OS: Ubuntu 18.04
Other: Python 3.8

Bug

What happened:
I want to utilise the cool new data skipping feature implemented by @wjones127 in #525 and #565.
However when using delta_table.to_pyarrow_dataset(), for a table which only has struct stats, it opens every parquet file unnecessarily.

What you expected to happen:
The delta stats provide sufficient information to narrow it down to just one parquet file so it should only need to open the one file.

How to reproduce it:
Script to reproduce: reproduce_struct_stats_issue.zip

Steps this script does

Create a delta table with properties: "delta.checkpoint.writeStatsAsJson": "false" and "delta.checkpoint.writeStatsAsStruct": "true".
Append rows to the table until it creates a checkpoint.
Open the table using DeltaTable.to_pyarrow_dataset()
Get the parquet fragments when applying a filter. Check the number of fragments is equal to the number of files that could match the filter expression when you have knowledge of the delta stats.

I have also created a branch with a unittest to catch this https://github.com/Tom-Newton/delta-rs/pull/1.

More details:
Logging the file and part_expressions returned:

/tmp/reproduce_stats_issue/part-00000-431a48bb-dc24-438f-8329-60e9eae4ca22-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-1cbe22db-1188-46b9-b52d-e1c4a3ab6b61-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-c84ba96d-a42a-441f-b440-a22c1a48d344-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-beb74db9-8db2-431d-9be0-a437a01abe47-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-3c21a7a2-0405-4d33-8be5-b77286af386a-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-21283900-d6b2-48bc-94d7-88b554e32402-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-2ab94ca6-f2de-4b0f-869f-deccad42990c-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-7ff0882e-072b-4fe8-8642-1e378360395c-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-4140aad9-87f4-4c5a-a47a-6ad4ee4e57bf-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-924efae5-8d48-4e1e-a33f-11877165f0a0-c000.snappy.parquet None
/tmp/reproduce_stats_issue/part-00000-15d64514-967b-4ac4-b241-063a6b2f38e6-c000.snappy.parquet (((a >= 10) and (a <= 10)) and is_valid(a))
/tmp/reproduce_stats_issue/part-00000-25717410-d4c8-4c0f-b209-ff1aa56905e6-c000.snappy.parquet (((a >= 11) and (a <= 11)) and is_valid(a))

We can see that the part is None for all the files added prior to the latest checkpoint. The 2 files after the latest checkpoint have correct part_expressions since the stats are simply read from the json commits.

From looking at the code I think it never attempts to utilise struct stats in checkpoint files. Therefore if like me your tables are configured to use only struct stats in checkpoints you will miss out on this really cool feature.

I might have a go at making a PR for this but with my extremely limited rust knowledge it will be a challenge.

The text was updated successfully, but these errors were encountered:

wjones127 · 2022-06-20T14:18:55Z

Thanks for reporting this.

It looks like we are reading this in Rust, but the struct stats are held in a separate field in the Add from the JSON-based stats:

delta-rs/rust/src/action.rs

Lines 176 to 182 in 2ac3c3e

    
           /// Contains statistics (e.g., count, min/max values for columns) about the data in this file in 
        
           /// raw parquet format. This field needs to be written when statistics are available and the 
        
           /// table property: delta.checkpoint.writeStatsAsStruct is set to true. 
        
           /// 
        
           /// This field is only available in add action records read from checkpoints 
        
           #[serde(skip_serializing, skip_deserializing)] 
        
           pub stats_parsed: Option<parquet::record::Row>,

We currently call the function get_stats() on add actions, which only gets the JSON ones. There is a separate function get_stats_parsed() that gets the struct one. They return different types.

@houqp What would you think of unifying those two functions so get_stats() returns JSON ones or parquet ones depending on what's available? Or is there good reason to keep them separate?

houqp · 2022-06-21T06:05:55Z

There is no good reason to keep them separate other than being lazy :) @wjones127 what you proposed is the short term fix we should definitely implement. In the long run, when we switch to columnar in memory format for all the actions, we can perform the unification at parse time to convert both of them into the same in memory data structure.

Tom-Newton · 2022-06-23T11:34:52Z

It looks like we are agreed on how to resolve this 🙂 . I made https://github.com/Tom-Newton/delta-rs/pull/3 which mostly fixes the functionality and proved to me the value of having this feature.

I can try to tidy it up and to handle struct and array types etc but as somebody who is writing rust code for the first time I don't feel particularly confident in my ability to do that...

# Description When a delta table's schema is evolved the struct stat schemas in checkpoints are also evolved. Since the struct stats are stored in a columnar way adding a single file with the new columns will cause nulls to appear in the struct stats for all other files. This is a significant difference compared to the json stats. Unfortunately I overlooked this in #656 for both nullCounts and min/max values. This caused parsed struct stats to have extra columns full of nulls. I don't know if this was actually an issue at all but it should be fixed even if just for the sake of the warnings spam. ``` [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null ``` # Related Issue(s) - Relates to #653 but for the most part its an already solved issue. # Changes: - Replace the test data with similar test data that includes a schema evolution. - Add error handling for min/max values to ensure warnings will be logged for other unexpected types (there probably shouldn't be any). As a total rust noob I originally filled with nulls but I think that was a mistake. - Ignore nulls for min/max stats and null count stats since these are expected after schema evolution and should be ignored without logging a warning. Usual disclaimer on a PR from me: I don't know what I'm doing writing rust code. (thanks to wjones for tidying up my dodgy rust code 🙂) Co-authored-by: Will Jones <[email protected]>

Tom-Newton added the bug Something isn't working label Jun 20, 2022

Tom-Newton mentioned this issue Jun 26, 2022

Utilise struct stats when available #656

Merged

wjones127 closed this as completed in #656 Jul 6, 2022

Tom-Newton mentioned this issue Jul 26, 2022

Fix parsing null counts for struct type columns in the struct stats #714

Merged

Tom-Newton mentioned this issue Oct 24, 2022

Fix parsing struct stats after schema evolution #901

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't utilise struct delta stats with delta_table.to_pyarrow_dataset() #653

Can't utilise struct delta stats with delta_table.to_pyarrow_dataset() #653

Tom-Newton commented Jun 20, 2022

wjones127 commented Jun 20, 2022

houqp commented Jun 21, 2022

Tom-Newton commented Jun 23, 2022 •

edited

Loading

Can't utilise struct delta stats with delta_table.to_pyarrow_dataset() #653

Can't utilise struct delta stats with delta_table.to_pyarrow_dataset() #653

Comments

Tom-Newton commented Jun 20, 2022

Environment

Bug

wjones127 commented Jun 20, 2022

houqp commented Jun 21, 2022

Tom-Newton commented Jun 23, 2022 • edited Loading

Tom-Newton commented Jun 23, 2022 •

edited

Loading