-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regression with Incorrect results when reading parquet files with different schemas and statistics #8533
Conversation
de3fc46
to
df9d082
Compare
df9d082
to
5633b43
Compare
@@ -468,8 +468,10 @@ impl FileOpener for ParquetOpener { | |||
ParquetRecordBatchStreamBuilder::new_with_options(reader, options) | |||
.await?; | |||
|
|||
let file_schema = builder.schema().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought giving builder.schema()
a name made the code clearer.
@@ -481,8 +483,8 @@ impl FileOpener for ParquetOpener { | |||
if let Some(predicate) = pushdown_filters.then_some(predicate).flatten() { | |||
let row_filter = row_filter::build_row_filter( | |||
&predicate, | |||
builder.schema().as_ref(), | |||
table_schema.as_ref(), | |||
&file_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by cleanup -- no functioanl change
@@ -80,7 +81,7 @@ pub(crate) fn prune_row_groups_by_statistics( | |||
let pruning_stats = RowGroupPruningStatistics { | |||
parquet_schema, | |||
row_group_metadata: metadata, | |||
arrow_schema: predicate.schema().as_ref(), | |||
arrow_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual fix -- to use the file schema rather than the table schema to lookup columns
@@ -416,11 +417,11 @@ mod tests { | |||
fn row_group_pruning_predicate_simple_expr() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I updated the signature of prune_row_groups_by_statistics
I also had to update the tests appropriately
|
||
# Should see all 7 rows that have 'a=foo' | ||
query TIR rowsort | ||
select * from parquet_table where a = 'foo'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query only returns 3 rows (not all 7) without this fix. You can see the differences here: 5633b43
########## | ||
|
||
|
||
statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really loving how easy it is to write end to end tests with files after all the work @devinjdangelo did for parallel partitioned writes ❤️
@@ -923,6 +944,7 @@ mod tests { | |||
let metrics = parquet_file_metrics(); | |||
assert_eq!( | |||
prune_row_groups_by_statistics( | |||
&schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are existing tests? Should we add one test of prune_row_groups_by_statistics
which schema
(file schema) is different to predicate schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good idea -- I will do so
The reason I think an end to end test in .slt
is also required is that the bug is related to passing the wrong schema into prune_row_groups_by_statistics
-- so I could write a test in terms of prune_row_groups_by_statistics
that passes (passes in the right schema) but actual answers would still be wrong because the callsite of prune_row_groups_by_statistics
would pass the wrong one
foo NULL NULL | ||
foo NULL NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just for other reviewers to easily review the results)
File1:
foo 1 NULL
foo 2 NULL
foo 3 NULL
File2:
NULL 10 NULL
File3:
foo NULL NULL
foo NULL NULL
File4:
foo 100 10.5
foo 200 12.6
bzz 300 13.7
… different schemas and statistics (#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
… different schemas and statistics (apache#8533) * Add test for schema evolution * Fix reading parquet statistics * Update tests for fix * Add comments to help explain the test * Add another test
Which issue does this PR close?
Fixes #8532
Rationale for this change
#8294 introduced a regression which will cause parquet files to be incorrectly pruned in some cases, leading to incorrect query results (some rows are filtered out incorrectly)
Note this regression was not released yet
What changes are included in this PR?
Are these changes tested?
Yes, new end to end .slt coverage is added
Are there any user-facing changes?
Bug fix (for an unreleased bug)