You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened: .get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed.
It returns only the stats for columns from the latest file.
What you expected to happen:
It should return all the stats for all columns of all the files.
How to reproduce it:
Had to do it in delta-spark, since I don't think there is a way of setting table properties in delta-rs.
path = f"./latestversion"
data = [
(1, "a", None),
(2, "b", "b"),
(3, "c", "c"),
]
df = spark.createDataFrame(
data,
[
"col1",
"col2",
"col3",
],
)
df.coalesce(1).write.mode("append").format("delta").save(path)
# .get_add_actions() will return min, max and null_count for col1, col2 and col3
spark.sql("ALTER TABLE delta.`FULL_PATH` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 1);")
df.coalesce(1).write.mode("append").format("delta").save(path)
# .get_add_actions() will only return min, max and null_count for col1
The text was updated successfully, but these errors were encountered:
This is caused by how DeltaTableState::stats_as_batch is producing aggregates here.
In your example above, the values of column max.col3 would look like ["c", None] which
.collect::<Option<Vec<&serde_json::Value>>>()
will convert into None because according to Option::from_iter api docs:
Takes each element in the Iterator: if it is None, no further elements are taken, and the None is returned.
It should return all the stats for all columns of all the files.
@robertkossendey do you want it to return statistics for all columns? Or just all columns that at some point collected statistics? I agree the current behavior is wrong, but I'm just not sure we want to return statistics for columns that are always null.
I could see an argument that it's nice to have all the columns represented there. But there may be some users who have tables with very large schemas, where they intentionally don't collect statistics for most columns.
# Description
This is a proposal for how #1223 could be fixed.
# Related Issue(s)
- fixes#1223
# Documentation
The current implementation excludes all columns that lack statistical
information. The proposed fix will generate information for all columns,
with missing statistical values being replaced by 'null' values.
However, it is unclear if this is the correct behavior since the
`stats_as_batch` function lacks documentation.
Co-authored-by: Will Jones <[email protected]>
Environment
Delta-rs version:
0.7.0
Binding:
Python
Bug
What happened:
.get_add_actions()
returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed.It returns only the stats for columns from the latest file.
What you expected to happen:
It should return all the stats for all columns of all the files.
How to reproduce it:
Had to do it in delta-spark, since I don't think there is a way of setting table properties in delta-rs.
The text was updated successfully, but these errors were encountered: