Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed #1223

Closed
robertkossendey opened this issue Mar 14, 2023 · 2 comments · Fixed by #1342
Labels
bug Something isn't working

Comments

@robertkossendey
Copy link

Environment

Delta-rs version:
0.7.0

Binding:
Python


Bug

What happened:
.get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed.
It returns only the stats for columns from the latest file.

What you expected to happen:
It should return all the stats for all columns of all the files.

How to reproduce it:

Had to do it in delta-spark, since I don't think there is a way of setting table properties in delta-rs.

path = f"./latestversion"

data = [
    (1, "a", None),
    (2, "b", "b"),
    (3, "c", "c"),
]
df = spark.createDataFrame(
    data,
    [
        "col1",
        "col2",
        "col3",
    ],
)
    
df.coalesce(1).write.mode("append").format("delta").save(path)

# .get_add_actions() will return min, max and null_count for col1, col2 and col3

spark.sql("ALTER TABLE delta.`FULL_PATH` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 1);")

df.coalesce(1).write.mode("append").format("delta").save(path)

# .get_add_actions() will only return min, max and null_count for col1
@mrjoe7
Copy link
Contributor

mrjoe7 commented May 7, 2023

This is caused by how DeltaTableState::stats_as_batch is producing aggregates here.
In your example above, the values of column max.col3 would look like ["c", None] which

.collect::<Option<Vec<&serde_json::Value>>>()

will convert into None because according to Option::from_iter api docs:

Takes each element in the Iterator: if it is None, no further elements are taken, and the None is returned.

@wjones127
Copy link
Collaborator

It should return all the stats for all columns of all the files.

@robertkossendey do you want it to return statistics for all columns? Or just all columns that at some point collected statistics? I agree the current behavior is wrong, but I'm just not sure we want to return statistics for columns that are always null.

I could see an argument that it's nice to have all the columns represented there. But there may be some users who have tables with very large schemas, where they intentionally don't collect statistics for most columns.

wjones127 added a commit that referenced this issue May 13, 2023
# Description
This is a proposal for how #1223 could be fixed.

# Related Issue(s)
- fixes #1223

# Documentation
The current implementation excludes all columns that lack statistical
information. The proposed fix will generate information for all columns,
with missing statistical values being replaced by 'null' values.
However, it is unclear if this is the correct behavior since the
`stats_as_batch` function lacks documentation.

Co-authored-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants