-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parquet_metadata for datafusion-cli #8413
Conversation
Signed-off-by: veeupup <[email protected]>
@alamb PTAL : ) how do you like the display format? |
Thank you @Veeupup -- I will check this out later today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool feature!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Veeupup and @simonvandel -- this is a great start.
I would like to suggest that we follow DuckDB for parquet_metadata
and not reinvent the wheel / formatting here.
That that means one output row per file, per row group
Each row has 23 columns
Here is an example of what duckdb
s output looks like:
D select * from read_parquet('./parquet-testing/data/int64_decimal.parquet');
┌───────────────┐
│ value │
│ decimal(10,2) │
├───────────────┤
│ 1.00 │
│ 2.00 │
│ 3.00 │
│ 4.00 │
│ 5.00 │
│ 6.00 │
│ 7.00 │
│ 8.00 │
│ 9.00 │
│ 10.00 │
│ 11.00 │
│ 12.00 │
│ 13.00 │
│ 14.00 │
│ 15.00 │
│ 16.00 │
│ 17.00 │
│ 18.00 │
│ 19.00 │
│ 20.00 │
│ 21.00 │
│ 22.00 │
│ 23.00 │
│ 24.00 │
├───────────────┤
│ 24 rows │
└───────────────┘
D .mode markdown
D select * from parquet_metadata('./parquet-testing/data/int64_decimal.parquet');
file_name | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type | stats_min | stats_max | stats_null_count | stats_distinct_count | stats_min_value | stats_max_value | compression | encodings | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
./parquet-testing/data/int64_decimal.parquet | 0 | 24 | 1 | 241 | 0 | 24 | value | INT64 | 1.00 | 24.00 | 0 | UNCOMPRESSED | BIT_PACKED, RLE, PLAIN | 4 | 241 | 241 | ||||||
D |
Here is the schema
D describe select * from parquet_metadata('./parquet-testing/data/int64_decimal.parquet');
| column_name | column_type | null | key | default | extra |
|-------------------------|-------------|------|-----|---------|-------|
| file_name | VARCHAR | YES | | | |
| row_group_id | BIGINT | YES | | | |
| row_group_num_rows | BIGINT | YES | | | |
| row_group_num_columns | BIGINT | YES | | | |
| row_group_bytes | BIGINT | YES | | | |
| column_id | BIGINT | YES | | | |
| file_offset | BIGINT | YES | | | |
| num_values | BIGINT | YES | | | |
| path_in_schema | VARCHAR | YES | | | |
| type | VARCHAR | YES | | | |
| stats_min | VARCHAR | YES | | | |
| stats_max | VARCHAR | YES | | | |
| stats_null_count | BIGINT | YES | | | |
| stats_distinct_count | BIGINT | YES | | | |
| stats_min_value | VARCHAR | YES | | | |
| stats_max_value | VARCHAR | YES | | | |
| compression | VARCHAR | YES | | | |
| encodings | VARCHAR | YES | | | |
| index_page_offset | BIGINT | YES | | | |
| dictionary_page_offset | BIGINT | YES | | | |
| data_page_offset | BIGINT | YES | | | |
| total_compressed_size | BIGINT | YES | | | |
| total_uncompressed_size | BIGINT | YES | | | |
datafusion-cli/src/functions.rs
Outdated
.iter() | ||
.map(|rg| { | ||
format!( | ||
"num_columns: {}, num_rows: {}, total_byte_size: {}, sorting_columns: {:?}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every row group should have the same number of columns, for what it is worth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @simonvandel Hi! I made The above is from duckdb, and the below is from datafusion. You can click on the picture to see more clearly! |
Signed-off-by: veeupup <[email protected]>
dfbd127
to
5516189
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this out and it is quite neat. Thank you @Veeupup . This and all your previous PRs have been a joy to review.
I think this could be merged as is and we could make improvements as a follow on, as it is new functionality. However, it would also be neat to fix them prior to merge
I left suggestions about what needed to be fixed
Request 1: Can we make string literal arguments work? Something like
❯ select * from parquet_metadata('/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet');
Error during planning: parquet_metadata requires string argument as its input
Request 2 Do not panic when statistics are not set
❯ select * from parquet_metadata("/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/tests/data/clickbench_hits_10.parquet");
thread 'main' panicked at /Users/andrewlamb/.cargo/registry/src/index.crates.io-6f17d22bba15001f/parquet-49.0.0/src/file/statistics.rs:474:27:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
I highlighted the problem I think in comments
Request 3: Add some tests
It would be really nice if you could add some regression tests to make sure we don't break this in the future. For example, could we read and verify the data that is read from the two files (one that has min/max stats set and one that does not?)
Thanks again
@@ -185,6 +186,8 @@ pub async fn main() -> Result<()> { | |||
ctx.state().catalog_list(), | |||
ctx.state_weak_ref(), | |||
))); | |||
// register `parquet_metadata` table function to get metadata from parquet files | |||
ctx.register_udtf("parquet_metadata", Arc::new(ParquetMetadataFunc {})); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
d237d66
to
5f7d8c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it -- thank you @Veeupup -- really nice work
This feature was so cool I started using it, and I found one suggestion for improvement already: #8464 |
* Support parquet_metadata for datafusion-cli Signed-off-by: veeupup <[email protected]> * make tomlfmt happy * display like duckdb Signed-off-by: veeupup <[email protected]> * add test & fix single quote --------- Signed-off-by: veeupup <[email protected]>
Which issue does this PR close?
Closes #8367
Function behaves just like duckdb.
The above is from duckdb, and the below is from datafusion.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?