Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533

Merged
merged 6 commits into from
Dec 14, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 13, 2023

Which issue does this PR close?

Fixes #8532

Rationale for this change

#8294 introduced a regression which will cause parquet files to be incorrectly pruned in some cases, leading to incorrect query results (some rows are filtered out incorrectly)

Note this regression was not released yet

What changes are included in this PR?

  1. Fix the bug
  2. Add test coverage (slt)

Are these changes tested?

Yes, new end to end .slt coverage is added

Are there any user-facing changes?

Bug fix (for an unreleased bug)

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 13, 2023
@@ -468,8 +468,10 @@ impl FileOpener for ParquetOpener {
ParquetRecordBatchStreamBuilder::new_with_options(reader, options)
.await?;

let file_schema = builder.schema().clone();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought giving builder.schema() a name made the code clearer.

@@ -481,8 +483,8 @@ impl FileOpener for ParquetOpener {
if let Some(predicate) = pushdown_filters.then_some(predicate).flatten() {
let row_filter = row_filter::build_row_filter(
&predicate,
builder.schema().as_ref(),
table_schema.as_ref(),
&file_schema,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive by cleanup -- no functioanl change

@@ -80,7 +81,7 @@ pub(crate) fn prune_row_groups_by_statistics(
let pruning_stats = RowGroupPruningStatistics {
parquet_schema,
row_group_metadata: metadata,
arrow_schema: predicate.schema().as_ref(),
arrow_schema,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual fix -- to use the file schema rather than the table schema to lookup columns

@@ -416,11 +417,11 @@ mod tests {
fn row_group_pruning_predicate_simple_expr() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I updated the signature of prune_row_groups_by_statistics I also had to update the tests appropriately


# Should see all 7 rows that have 'a=foo'
query TIR rowsort
select * from parquet_table where a = 'foo';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query only returns 3 rows (not all 7) without this fix. You can see the differences here: 5633b43

##########


statement ok
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really loving how easy it is to write end to end tests with files after all the work @devinjdangelo did for parallel partitioned writes ❤️

@@ -923,6 +944,7 @@ mod tests {
let metrics = parquet_file_metrics();
assert_eq!(
prune_row_groups_by_statistics(
&schema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these are existing tests? Should we add one test of prune_row_groups_by_statistics which schema (file schema) is different to predicate schema?

Copy link
Contributor Author

@alamb alamb Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea -- I will do so

The reason I think an end to end test in .slt is also required is that the bug is related to passing the wrong schema into prune_row_groups_by_statistics -- so I could write a test in terms of prune_row_groups_by_statistics that passes (passes in the right schema) but actual answers would still be wrong because the callsite of prune_row_groups_by_statistics would pass the wrong one

Comment on lines +74 to +75
foo NULL NULL
foo NULL NULL
Copy link
Member

@viirya viirya Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just for other reviewers to easily review the results)

File1:

foo 1 NULL
foo 2 NULL
foo 3 NULL

File2:

NULL 10 NULL

File3:

foo NULL NULL
foo NULL NULL

File4:

foo 100 10.5
foo 200 12.6
bzz 300 13.7

@andygrove andygrove merged commit 974d49c into apache:main Dec 14, 2023
22 checks passed
andygrove pushed a commit that referenced this pull request Dec 14, 2023
… different schemas and statistics (#8533)

* Add test for schema evolution

* Fix reading parquet statistics

* Update tests for fix

* Add comments to help explain the test

* Add another test
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 14, 2023
… different schemas and statistics (apache#8533)

* Add test for schema evolution

* Fix reading parquet statistics

* Update tests for fix

* Add comments to help explain the test

* Add another test
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 14, 2023
… different schemas and statistics (apache#8533)

* Add test for schema evolution

* Fix reading parquet statistics

* Update tests for fix

* Add comments to help explain the test

* Add another test
@alamb alamb deleted the alamb/fix_schema_stats branch December 14, 2023 17:45
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 2, 2024
… different schemas and statistics (apache#8533)

* Add test for schema evolution

* Fix reading parquet statistics

* Update tests for fix

* Add comments to help explain the test

* Add another test
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 3, 2024
… different schemas and statistics (apache#8533)

* Add test for schema evolution

* Fix reading parquet statistics

* Update tests for fix

* Add comments to help explain the test

* Add another test
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 8, 2024
… different schemas and statistics (apache#8533)

* Add test for schema evolution

* Fix reading parquet statistics

* Update tests for fix

* Add comments to help explain the test

* Add another test
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Jan 8, 2024
… different schemas and statistics (apache#8533)

* Add test for schema evolution

* Fix reading parquet statistics

* Update tests for fix

* Add comments to help explain the test

* Add another test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: Incorrect results when reading parquet files with different schemas and statistics
4 participants