.get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed #1223

robertkossendey · 2023-03-14T11:35:39Z

Environment

Delta-rs version:
0.7.0

Binding:
Python

Bug

What happened:
.get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed.
It returns only the stats for columns from the latest file.

What you expected to happen:
It should return all the stats for all columns of all the files.

How to reproduce it:

Had to do it in delta-spark, since I don't think there is a way of setting table properties in delta-rs.

path = f"./latestversion"

data = [
    (1, "a", None),
    (2, "b", "b"),
    (3, "c", "c"),
]
df = spark.createDataFrame(
    data,
    [
        "col1",
        "col2",
        "col3",
    ],
)
    
df.coalesce(1).write.mode("append").format("delta").save(path)

# .get_add_actions() will return min, max and null_count for col1, col2 and col3

spark.sql("ALTER TABLE delta.`FULL_PATH` SET TBLPROPERTIES(delta.dataSkippingNumIndexedCols = 1);")

df.coalesce(1).write.mode("append").format("delta").save(path)

# .get_add_actions() will only return min, max and null_count for col1

The text was updated successfully, but these errors were encountered:

mrjoe7 · 2023-05-07T23:14:53Z

This is caused by how DeltaTableState::stats_as_batch is producing aggregates here.
In your example above, the values of column max.col3 would look like ["c", None] which

.collect::<Option<Vec<&serde_json::Value>>>()

will convert into None because according to Option::from_iter api docs:

Takes each element in the Iterator: if it is None, no further elements are taken, and the None is returned.

wjones127 · 2023-05-08T00:08:15Z

It should return all the stats for all columns of all the files.

@robertkossendey do you want it to return statistics for all columns? Or just all columns that at some point collected statistics? I agree the current behavior is wrong, but I'm just not sure we want to return statistics for columns that are always null.

I could see an argument that it's nice to have all the columns represented there. But there may be some users who have tables with very large schemas, where they intentionally don't collect statistics for most columns.

# Description This is a proposal for how #1223 could be fixed. # Related Issue(s) - fixes #1223 # Documentation The current implementation excludes all columns that lack statistical information. The proposed fix will generate information for all columns, with missing statistical values being replaced by 'null' values. However, it is unclear if this is the correct behavior since the `stats_as_batch` function lacks documentation. Co-authored-by: Will Jones <[email protected]>

robertkossendey added the bug Something isn't working label Mar 14, 2023

robertkossendey mentioned this issue Mar 14, 2023

Added col stat helpers mrpowers-io/levi#10

Open

mrjoe7 mentioned this issue May 7, 2023

fix: include stats for all columns (#1223) #1342

Merged

wjones127 closed this as completed in #1342 May 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed #1223

.get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed #1223

robertkossendey commented Mar 14, 2023

mrjoe7 commented May 7, 2023

wjones127 commented May 8, 2023

.get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed #1223

.get_add_actions() returns wrong column statistics when dataSkippingNumIndexedCols property of the table was changed #1223

Comments

robertkossendey commented Mar 14, 2023

Environment

Bug

mrjoe7 commented May 7, 2023

wjones127 commented May 8, 2023