Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are invalidated columns tracked properly during data file creation? #862

Closed
cgbur opened this issue Jun 26, 2024 · 2 comments · Fixed by #911
Closed

Are invalidated columns tracked properly during data file creation? #862

cgbur opened this issue Jun 26, 2024 · 2 comments · Fixed by #911

Comments

@cgbur
Copy link

cgbur commented Jun 26, 2024

Apache Iceberg version

None

Please describe the bug 🐞

In the data_file_statistics_from_parquet_metadata function, we track the invalidate_col set which is largely present to mark when a column row group does not have statistic set, therefore we can not know the overall statistics accurately. I have a concern with the current implementation that I would like to hear those with more experience in this fields opinion on.

The current implementation resets the invalidated columns every row group while only deleting the invalid columns after all row groups have been processed. I can imagine a scenario where the first row group has no statistics, but the last row group does. In this scenario I think invalid iceberg metadata would be generated which includes metrics for columns that do not contain statistics in all groups. I think the fix for this is to hoist the invalid columns set out of the loops to avoid resetting the variable every row group.

currently:

def data_file_statistics_from_parquet_metadata():
    for r in range(parquet_metadata.num_row_groups):
        invalidate_col: Set[int] = set() # reset every row group
        for pos in range(parquet_metadata.num_columns):
            if column.is_stats_set:
            else:
                invalidate_col.add(field_id)


    for field_id in invalidate_col:
        del col_aggs[field_id]
        del null_value_counts[field_id]

proposal:

def data_file_statistics_from_parquet_metadata():
    invalidate_col: Set[int] = set() # moved so now tracks for whole file
    for r in range(parquet_metadata.num_row_groups):
        for pos in range(parquet_metadata.num_columns):
            if column.is_stats_set:
            else:
                invalidate_col.add(field_id)


    for field_id in invalidate_col:
        del col_aggs[field_id]
        del null_value_counts[field_id]
@Fokko
Copy link
Contributor

Fokko commented Jun 26, 2024

Hey @cgbur that's a great point that you're making here. I don't think this is taken into consideration, and might even lead to invalid data. Did you encounter this in practice, if so, do you know how to reproduce this?

The statistics are very fundamental to Iceberg, and having no statistics will introduce a lot of inefficiency in query planning and actual data processing (the file will be included in every query plan).

@cgbur
Copy link
Author

cgbur commented Jun 26, 2024

I have not come up with a reproducible example but discovered this when I found a bug in my code because pyarrow does not write stats for columns where all the values are null. I suspected that if a row group is null then it may do the same, but in that case it still writes out stats. I think whether you should write out stats for all null columns is an issue for that crate.

I do think its valid in the parquet spec to write stats for column A in row group 0 but not in row group 1 etc. I can not find it in the documentation of the spec but have no reason to believe this is invalid. In all likelihood, none of the current big writers do this. However, I do think the current behavior of reseting every row group and only using the results from the last row group is incorrect and should likely be changed. Currently stats are invalidated for a column if the last row group has no statistics, when I think the intention was to invalidate a columns stats if any of the row groups for that column has no statistics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants