Don't use parquet file offset for file range pruning #5997

tustvold · 2023-04-13T18:44:20Z

Which issue does this PR close?

Rationale for this change

file_offset is not the offset into the file where the column is located, but rather is the location of non-inlined metadata - https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L781. Some writers such as DuckDB will set it to 0, causing all row groups to be scheduled into the same partition

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb · 2023-04-13T19:15:12Z

I will give this a test on my performance machine

datafusion/core/src/physical_plan/file_format/parquet/row_groups.rs

alamb

I ran this on reproducer from #5995 and indeed it now goes much faster on the duckdb generated file.

Thank you @tustvold

+--------------+--------------+----------+-----------------+-------------------+---------------------+--------------------+--------------+----------+-------------+
| l_returnflag | l_linestatus | sum_qty  | sum_base_price  | sum_disc_price    | sum_charge          | avg_qty            | avg_price    | avg_disc | count_order |
+--------------+--------------+----------+-----------------+-------------------+---------------------+--------------------+--------------+----------+-------------+
| A            | F            | 37734107 | 56586554400.73  | 53758257134.8700  | 55909065222.827692  | 25.522005853257337 | 38273.129734 | 0.049985 | 1478493     |
| N            | F            | 991417   | 1487504710.38   | 1413082168.0541   | 1469649223.194375   | 25.516471920522985 | 38284.467760 | 0.050093 | 38854       |
| N            | O            | 74476040 | 111701729697.74 | 106118230307.6056 | 110367043872.497010 | 25.50222676958499  | 38249.117988 | 0.049996 | 2920374     |
| R            | F            | 37719753 | 56568041380.90  | 53741292684.6040  | 55889619119.831932  | 25.50579361269077  | 38250.854626 | 0.050009 | 1478870     |
+--------------+--------------+----------+-----------------+-------------------+---------------------+--------------------+--------------+----------+-------------+
4 rows in set. Query took 0.706 seconds.

datafusion/core/src/physical_plan/file_format/parquet/row_groups.rs

…ps.rs Co-authored-by: Andrew Lamb <[email protected]>

Don't use parquet file offset for file range pruning

5fdde54

github-actions bot added the core Core DataFusion crate label Apr 13, 2023

This was referenced Apr 13, 2023

datafusion-cli scanning a single large parquet file uses only a single core #5995

Closed

Poor reported performance of DataFusion against DuckDB and Hyper #5942

Closed

alamb reviewed Apr 13, 2023

View reviewed changes

datafusion/core/src/physical_plan/file_format/parquet/row_groups.rs Show resolved Hide resolved

alamb approved these changes Apr 13, 2023

View reviewed changes

datafusion/core/src/physical_plan/file_format/parquet/row_groups.rs Show resolved Hide resolved

tustvold and others added 4 commits April 13, 2023 20:27

Update datafusion/core/src/physical_plan/file_format/parquet/row_grou…

f5df27d

…ps.rs Co-authored-by: Andrew Lamb <[email protected]>

Format

b6ced94

Tweak logic

18b81ea

Update test

13082c9

tustvold merged commit 5c025cc into apache:main Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't use parquet file offset for file range pruning #5997

Don't use parquet file offset for file range pruning #5997

tustvold commented Apr 13, 2023 •

edited

Loading

alamb commented Apr 13, 2023

alamb left a comment

Don't use parquet file offset for file range pruning #5997

Don't use parquet file offset for file range pruning #5997

Conversation

tustvold commented Apr 13, 2023 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Apr 13, 2023

alamb left a comment

Choose a reason for hiding this comment

tustvold commented Apr 13, 2023 •

edited

Loading