-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utilise struct stats when available #656
Utilise struct stats when available #656
Conversation
Co-authored-by: Will Jones <[email protected]>
…struct_stats_when_available
python/tests/test_table_read.py
Outdated
@@ -193,16 +193,6 @@ def test_read_table_with_stats(): | |||
# assert data.num_rows == 0 | |||
|
|||
|
|||
def test_read_table_with_only_struct_stats(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm removing this because everything should now be covered by the lower level rust test I added. This did surface something interesting though. It seems like the DeltaTable.to_pyarrow_schema()
function cannot handle the struct of array of map type I put in my test data. I think this is probably an issue for another time though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this. I especially appreciate the example data with a variety of data type.
I left a few code suggestions.
I think we'll need to clean up decimal handling later. The behavior of Spark didn't match my existing expectations.
Field::Decimal(decimal) => match BigInt::from_signed_bytes_be(decimal.data()).to_f64() { | ||
Some(int) => json!(int / (10_i64.pow((decimal.scale()).try_into().unwrap()) as f64)), | ||
_ => serde_json::Value::Null, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So on the decimal representation, it does seem to be the case that the Spark implementation writes them out as numbers in the JSON file. I can't find the implementation code, but it does seem like they have a special parser that handles decimals. We don't so I'm worried about having a lossy conversion of Decimal into float.
It's a complicated subject and sort of an edge case, so I'm fine with this for now, but we should do a follow-up to make sure we are using decimal statistics appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was a bit concerned about this too. I'm tempted to just remove it since that is probably a safer option.
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Thanks @Tom-Newton ! |
I'm afraid I made a mistake in this PR. I just realised while testing #712. I've a made a follow up PR to fix this. |
# Description When a delta table's schema is evolved the struct stat schemas in checkpoints are also evolved. Since the struct stats are stored in a columnar way adding a single file with the new columns will cause nulls to appear in the struct stats for all other files. This is a significant difference compared to the json stats. Unfortunately I overlooked this in #656 for both nullCounts and min/max values. This caused parsed struct stats to have extra columns full of nulls. I don't know if this was actually an issue at all but it should be fixed even if just for the sake of the warnings spam. ``` [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null [2022-10-24T22:13:22Z WARN deltalake::action::parquet_read] Expect type of nullCount field to be struct or int64, got: null ``` # Related Issue(s) - Relates to #653 but for the most part its an already solved issue. # Changes: - Replace the test data with similar test data that includes a schema evolution. - Add error handling for min/max values to ensure warnings will be logged for other unexpected types (there probably shouldn't be any). As a total rust noob I originally filled with nulls but I think that was a mistake. - Ignore nulls for min/max stats and null count stats since these are expected after schema evolution and should be ignored without logging a warning. Usual disclaimer on a PR from me: I don't know what I'm doing writing rust code. (thanks to wjones for tidying up my dodgy rust code 🙂) Co-authored-by: Will Jones <[email protected]>
Description
Updates
Add.get_stats()
to use parquet structs stats when available by parsing them to the same format json stats are parsed to. This means delta-rs can still utilise stats on tables where checkpoint files are configured to only store struct stats.Related Issue(s)
Documentation
I did my best but I've never written rust code before...