Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533

alamb · 2023-12-13T21:41:43Z

Which issue does this PR close?

Rationale for this change

#8294 introduced a regression which will cause parquet files to be incorrectly pruned in some cases, leading to incorrect query results (some rows are filtered out incorrectly)

Note this regression was not released yet

What changes are included in this PR?

Fix the bug
Add test coverage (slt)

Are these changes tested?

Yes, new end to end .slt coverage is added

Are there any user-facing changes?

Bug fix (for an unreleased bug)

alamb · 2023-12-13T21:42:32Z

datafusion/core/src/datasource/physical_plan/parquet/mod.rs

@@ -468,8 +468,10 @@ impl FileOpener for ParquetOpener {
                ParquetRecordBatchStreamBuilder::new_with_options(reader, options)
                    .await?;

+            let file_schema = builder.schema().clone();


I thought giving builder.schema() a name made the code clearer.

alamb · 2023-12-13T21:43:34Z

datafusion/core/src/datasource/physical_plan/parquet/mod.rs

@@ -481,8 +483,8 @@ impl FileOpener for ParquetOpener {
            if let Some(predicate) = pushdown_filters.then_some(predicate).flatten() {
                let row_filter = row_filter::build_row_filter(
                    &predicate,
-                    builder.schema().as_ref(),
-                    table_schema.as_ref(),
+                    &file_schema,


drive by cleanup -- no functioanl change

alamb · 2023-12-13T21:51:06Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

@@ -80,7 +81,7 @@ pub(crate) fn prune_row_groups_by_statistics(
            let pruning_stats = RowGroupPruningStatistics {
                parquet_schema,
                row_group_metadata: metadata,
-                arrow_schema: predicate.schema().as_ref(),
+                arrow_schema,


This is the actual fix -- to use the file schema rather than the table schema to lookup columns

alamb · 2023-12-13T21:51:41Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

@@ -416,11 +417,11 @@ mod tests {
    fn row_group_pruning_predicate_simple_expr() {


Since I updated the signature of prune_row_groups_by_statistics I also had to update the tests appropriately

alamb · 2023-12-13T21:52:49Z

datafusion/sqllogictest/test_files/schema_evolution.slt

+
+# Should see all 7 rows that have 'a=foo'
+query TIR rowsort
+select * from parquet_table where a = 'foo';


This query only returns 3 rows (not all 7) without this fix. You can see the differences here: 5633b43

alamb · 2023-12-13T21:53:35Z

datafusion/sqllogictest/test_files/schema_evolution.slt

+##########
+
+
+statement ok


I am really loving how easy it is to write end to end tests with files after all the work @devinjdangelo did for parallel partitioned writes ❤️

viirya · 2023-12-13T22:29:14Z

datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

@@ -923,6 +944,7 @@ mod tests {
        let metrics = parquet_file_metrics();
        assert_eq!(
            prune_row_groups_by_statistics(
+                &schema,


Looks like these are existing tests? Should we add one test of prune_row_groups_by_statistics which schema (file schema) is different to predicate schema?

That is a good idea -- I will do so

The reason I think an end to end test in .slt is also required is that the bug is related to passing the wrong schema into prune_row_groups_by_statistics -- so I could write a test in terms of prune_row_groups_by_statistics that passes (passes in the right schema) but actual answers would still be wrong because the callsite of prune_row_groups_by_statistics would pass the wrong one

viirya · 2023-12-13T22:33:33Z

datafusion/sqllogictest/test_files/schema_evolution.slt

+foo NULL NULL
+foo NULL NULL


(Just for other reviewers to easily review the results)

File1:

foo 1 NULL foo 2 NULL foo 3 NULL

File2:

NULL 10 NULL

File3:

foo NULL NULL foo NULL NULL

File4:

foo 100 10.5 foo 200 12.6 bzz 300 13.7

… different schemas and statistics (#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test

… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test

Add test for schema evolution

dac4152

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 13, 2023

alamb force-pushed the alamb/fix_schema_stats branch from de3fc46 to df9d082 Compare December 13, 2023 21:43

alamb added 2 commits December 13, 2023 16:50

Fix reading parquet statistics

a454935

Update tests for fix

5633b43

alamb force-pushed the alamb/fix_schema_stats branch from df9d082 to 5633b43 Compare December 13, 2023 21:50

alamb commented Dec 13, 2023

View reviewed changes

alamb mentioned this pull request Dec 13, 2023

Extract parquet statistics to its own module, add tests #8294

Merged

alamb marked this pull request as ready for review December 13, 2023 21:56

alamb requested a review from tustvold December 13, 2023 21:57

alamb mentioned this pull request Dec 13, 2023

Regression: Incorrect results when reading parquet files with different schemas and statistics #8532

Closed

viirya reviewed Dec 13, 2023

View reviewed changes

viirya approved these changes Dec 13, 2023

View reviewed changes

alamb added 3 commits December 14, 2023 08:51

Add comments to help explain the test

1a43af3

Add another test

6773375

Merge remote-tracking branch 'apache/main' into alamb/fix_schema_stats

181a1b1

tustvold approved these changes Dec 14, 2023

View reviewed changes

andygrove merged commit 974d49c into apache:main Dec 14, 2023
22 checks passed

appletreeisyellow mentioned this pull request Dec 14, 2023

chore: temporary branch for IOx update (11-30-2023 to 12-09-2023) #8543

Closed

alamb deleted the alamb/fix_schema_stats branch December 14, 2023 17:45

appletreeisyellow mentioned this pull request Jan 2, 2024

chore: temporary branch for IOx update (12-10-2023 to 12-13-2023) #8722

Closed

appletreeisyellow mentioned this pull request Jan 3, 2024

chore: temporary branch for IOx update (12-10-2023 try 2) #8741

Closed

appletreeisyellow mentioned this pull request Jan 8, 2024

chore: temporary branch for IOx update (11-30-2023 to 12-07-2023) appletreeisyellow/datafusion#5

Closed

appletreeisyellow mentioned this pull request Jan 8, 2024

chore: temporary branch for IOx update (12-08-2023 to 12-09-2023) appletreeisyellow/datafusion#6

Closed

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533

Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533

alamb commented Dec 13, 2023

alamb Dec 13, 2023

alamb Dec 13, 2023

alamb Dec 13, 2023

alamb Dec 13, 2023

alamb Dec 13, 2023

alamb Dec 13, 2023

viirya Dec 13, 2023

alamb Dec 14, 2023 •

edited

Loading

viirya Dec 13, 2023 •

edited

Loading

		@@ -416,11 +417,11 @@ mod tests {
		fn row_group_pruning_predicate_simple_expr() {

		##########


		statement ok

Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533

Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533

Conversation

alamb commented Dec 13, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb Dec 13, 2023

Choose a reason for hiding this comment

alamb Dec 13, 2023

Choose a reason for hiding this comment

alamb Dec 13, 2023

Choose a reason for hiding this comment

alamb Dec 13, 2023

Choose a reason for hiding this comment

alamb Dec 13, 2023

Choose a reason for hiding this comment

alamb Dec 13, 2023

Choose a reason for hiding this comment

viirya Dec 13, 2023

Choose a reason for hiding this comment

alamb Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

viirya Dec 13, 2023 • edited Loading

Choose a reason for hiding this comment

alamb Dec 14, 2023 •

edited

Loading

viirya Dec 13, 2023 •

edited

Loading