Incorrect statistics extracted from parquet data pages when all values are null #11280

alamb · 2024-07-05T11:05:41Z

Originally posted by @efredine in #10922 (comment)

We always flatten the date page stats iterator - following the pattern from the initial PR: https://github.com/apache/datafusion/pull/10852/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40R582

But I'm wondering if flatten is the right thing to do here?

The min or max values for each page will be None if all the values on the page happen to be null: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/page_index/index.rs#L37-L44

Using flatten in this case will mean that the length of result for that page will be shorter than the number of data pages? So, is it possible that rather than flatten we instead want to do something like a flat map where the Some values are flattened and None values are mapped to a null value?

Potential user impact:

The code appended nulls for missing values. However, I think in most cases, missing values are simply omitted because all the None values are removed by flattening. So, in general, users of the data page statistics will need to check whether or not the length of the array matches the number of actual data pages? This is different from how the row group statistics are handled - they will instead have a null value for any missing statistics.

Is this difference in behaviour expected or just a side effect of the implementation.

A: I think it is a side effect of implementation and not a good one

alamb · 2024-07-05T11:06:45Z

Ideally what I think we should do is to write up a test case (using your suggestion of a column / page that is entirely null) and verify there is a problem / fix it.

efredine · 2024-07-05T13:22:14Z

Take

efredine · 2024-07-05T16:58:53Z

Ok - quick update - I had some misunderstanding of what was going on. I think there may still be a problem, but its different from what I originally thought.

My misunderstanding: the top level flatten is because we have an iterator of iterators. The inner iterator is iterating over the page indexes within a ColumnIndex. That inner iterator returns an Option.

And the original PR added an explicit test case for the scenario of a data page with all null values:
https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet/arrow_statistics.rs#L475-L504

However, the tests for all the other data types don't cover this scenario and there are a bunch of places where we are doing a filter_map in the inner loop which should probably just be a map. But I need to write tests to prove this first and then expand the coverage if it turns out to be the case.

alamb mentioned this issue Jul 5, 2024

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

Closed

23 tasks

alamb mentioned this issue Jul 5, 2024

Further to the performance discussion @alamb - the StringBuilder pattern you suggested in https://github.com/apache/datafusion/pull/11136#discussion_r1657725214 does seem to materially improve performance: #11279

Closed

github-actions bot assigned efredine Jul 5, 2024

efredine mentioned this issue Jul 5, 2024

Fix data page statistics when all rows are null in a data page #11295

Merged

alamb closed this as completed in #11295 Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect statistics extracted from parquet data pages when all values are null #11280

Incorrect statistics extracted from parquet data pages when all values are null #11280

alamb commented Jul 5, 2024 •

edited

Loading

alamb commented Jul 5, 2024

efredine commented Jul 5, 2024

efredine commented Jul 5, 2024

Incorrect statistics extracted from parquet data pages when all values are null #11280

Incorrect statistics extracted from parquet data pages when all values are null #11280

Comments

alamb commented Jul 5, 2024 • edited Loading

Potential user impact:

alamb commented Jul 5, 2024

efredine commented Jul 5, 2024

efredine commented Jul 5, 2024

alamb commented Jul 5, 2024 •

edited

Loading