Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve string statistics display in datafusion-cli parquet_metadata function #8535

Merged
merged 1 commit into from
Dec 14, 2023

Conversation

asimsedhain
Copy link
Contributor

Which issue does this PR close?

Closes #8464

Rationale for this change

What changes are included in this PR?

Output for the data_index_bloom_encoding_stats.parquet file
Datafusion
Screenshot 2023-12-13 at 10 00 09 PM
DuckDB
Screenshot 2023-12-13 at 10 00 36 PM

Are these changes tested?

Yes

Are there any user-facing changes?

Note

One thing I did notice while testing this was that, for parquet-testing/data/hadoop_lz4_compressed.parquet file, the output was still a byte array.
Screenshot 2023-12-13 at 10 17 50 PM

I checked the converted type was None for that column so, not sure if just blindly converting byte array into utf-8 string would be the right approach. Open to suggestions.

@alamb alamb changed the title feat: improve string statistics display feat: improve string statistics display in datafusion-cli parquet_metadata function Dec 14, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @asimsedhain -- this looks great. I kicked off the CI run and if it passes I plan to merge this PR

I tried it out locally and it looks pretty sweet

select row_group_id, row_group_num_rows, path_in_schema, type, stats_min, stats_max, stats_null_count from parquet_metadata('/Users/andrewlamb/Software/arrow-datafusion/parquet-testing/data/alltypes_tiny_pages.parquet');
+--------------+--------------------+-------------------+------------+-----------+-------------------+------------------+
| row_group_id | row_group_num_rows | path_in_schema    | type       | stats_min | stats_max         | stats_null_count |
+--------------+--------------------+-------------------+------------+-----------+-------------------+------------------+
| 0            | 7300               | "id"              | INT32      | 0         | 7299              | 0                |
| 0            | 7300               | "bool_col"        | BOOLEAN    | false     | true              | 0                |
| 0            | 7300               | "tinyint_col"     | INT32      | 0         | 9                 | 0                |
| 0            | 7300               | "smallint_col"    | INT32      | 0         | 9                 | 0                |
| 0            | 7300               | "int_col"         | INT32      | 0         | 9                 | 0                |
| 0            | 7300               | "bigint_col"      | INT64      | 0         | 90                | 0                |
| 0            | 7300               | "float_col"       | FLOAT      | 0         | 9.9               | 0                |
| 0            | 7300               | "double_col"      | DOUBLE     | 0         | 90.89999999999999 | 0                |
| 0            | 7300               | "date_string_col" | BYTE_ARRAY | 01/01/09  | 12/31/10          | 0                |
| 0            | 7300               | "string_col"      | BYTE_ARRAY | 0         | 9                 | 0                |
| 0            | 7300               | "timestamp_col"   | INT96      |           |                   | 0                |
| 0            | 7300               | "year"            | INT32      | 2009      | 2010              | 0                |
| 0            | 7300               | "month"           | INT32      | 1         | 12                | 0                |
+--------------+--------------------+-------------------+------------+-----------+-------------------+------------------+
13 rows in set. Query took 0.012 seconds.

cc @Veeupup

@alamb alamb merged commit 1042095 into apache:main Dec 14, 2023
23 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

Thanks again @asimsedhain

@Veeupup
Copy link
Contributor

Veeupup commented Dec 18, 2023

Thanks @asimsedhain ! good job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve string statistics display in datafusion-cli parquet_metadata
3 participants