You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the data_file_statistics_from_parquet_metadata function, we track the invalidate_col set which is largely present to mark when a column row group does not have statistic set, therefore we can not know the overall statistics accurately. I have a concern with the current implementation that I would like to hear those with more experience in this fields opinion on.
The current implementation resets the invalidated columns every row group while only deleting the invalid columns after all row groups have been processed. I can imagine a scenario where the first row group has no statistics, but the last row group does. In this scenario I think invalid iceberg metadata would be generated which includes metrics for columns that do not contain statistics in all groups. I think the fix for this is to hoist the invalid columns set out of the loops to avoid resetting the variable every row group.
defdata_file_statistics_from_parquet_metadata():
invalidate_col: Set[int] =set() # moved so now tracks for whole fileforrinrange(parquet_metadata.num_row_groups):
forposinrange(parquet_metadata.num_columns):
ifcolumn.is_stats_set:
else:
invalidate_col.add(field_id)
forfield_idininvalidate_col:
delcol_aggs[field_id]
delnull_value_counts[field_id]
The text was updated successfully, but these errors were encountered:
Hey @cgbur that's a great point that you're making here. I don't think this is taken into consideration, and might even lead to invalid data. Did you encounter this in practice, if so, do you know how to reproduce this?
The statistics are very fundamental to Iceberg, and having no statistics will introduce a lot of inefficiency in query planning and actual data processing (the file will be included in every query plan).
I have not come up with a reproducible example but discovered this when I found a bug in my code because pyarrow does not write stats for columns where all the values are null. I suspected that if a row group is null then it may do the same, but in that case it still writes out stats. I think whether you should write out stats for all null columns is an issue for that crate.
I do think its valid in the parquet spec to write stats for column A in row group 0 but not in row group 1 etc. I can not find it in the documentation of the spec but have no reason to believe this is invalid. In all likelihood, none of the current big writers do this. However, I do think the current behavior of reseting every row group and only using the results from the last row group is incorrect and should likely be changed. Currently stats are invalidated for a column if the last row group has no statistics, when I think the intention was to invalidate a columns stats if any of the row groups for that column has no statistics.
Apache Iceberg version
None
Please describe the bug 🐞
In the
data_file_statistics_from_parquet_metadata
function, we track theinvalidate_col
set which is largely present to mark when a column row group does not have statistic set, therefore we can not know the overall statistics accurately. I have a concern with the current implementation that I would like to hear those with more experience in this fields opinion on.The current implementation resets the invalidated columns every row group while only deleting the invalid columns after all row groups have been processed. I can imagine a scenario where the first row group has no statistics, but the last row group does. In this scenario I think invalid iceberg metadata would be generated which includes metrics for columns that do not contain statistics in all groups. I think the fix for this is to hoist the invalid columns set out of the loops to avoid resetting the variable every row group.
currently:
proposal:
The text was updated successfully, but these errors were encountered: