Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet ColumnIndex for null columns is written even when statistics are disabled #6010

Closed
etseidl opened this issue Jul 5, 2024 · 1 comment · Fixed by #6011
Closed
Labels
bug parquet Changes to the parquet crate

Comments

@etseidl
Copy link
Contributor

etseidl commented Jul 5, 2024

Describe the bug
Inspection of the page index metadata shows that the ColumnIndex for columns that are all nulls is written regardless of the setting of GenericColumnWriter::statistics_enabled.

To Reproduce
In the output below, the c_login column has a ColumnIndex entry as well as the expected OffsetIndex. No other columns do.

% target/debug/parquet-rewrite -i parquet-testing/data/delta_byte_array.parquet -o test.parquet --statistics-enabled none
% pqmeta -i test.parquet                                                                                           Rowgroup 0: num_rows:1000
-----------------------------------------------------------

c_customer_id
--------------------------------------------------
OffsetIndex:
  page0: offset:20023 compressed_size:1280 first_row_index:0 var_bytes:16000


c_salutation
--------------------------------------------------
OffsetIndex:
  page0: offset:21404 compressed_size:473 first_row_index:0 var_bytes:3145


c_first_name
--------------------------------------------------
OffsetIndex:
  page0: offset:26992 compressed_size:1325 first_row_index:0 var_bytes:5650


c_last_name
--------------------------------------------------
OffsetIndex:
  page0: offset:35182 compressed_size:1323 first_row_index:0 var_bytes:6011


c_preferred_cust_flag
--------------------------------------------------
OffsetIndex:
  page0: offset:36567 compressed_size:235 first_row_index:0 var_bytes:971


c_birth_country
--------------------------------------------------
OffsetIndex:
  page0: offset:39532 compressed_size:1097 first_row_index:0 var_bytes:8458


c_login
--------------------------------------------------
OffsetIndex:
  page0: offset:40685 compressed_size:26 first_row_index:0 var_bytes:0

ColumnIndex: boundary_order:ASCENDING
  page0: null_page:True min_val: max_val: null_count:1000

c_email_address
--------------------------------------------------
OffsetIndex:
  page0: offset:71199 compressed_size:1334 first_row_index:0 var_bytes:26562


c_last_review_date
--------------------------------------------------
OffsetIndex:
  page0: offset:76368 compressed_size:1200 first_row_index:0 var_bytes:6825

Expected behavior
Null columns should not have a ColumnIndex present when page statistics are not enabled.

Additional context

@alamb
Copy link
Contributor

alamb commented Jul 24, 2024

label_issue.py automatically added labels {'parquet'} from #6011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants