Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve string statistics display in datafusion-cli parquet_metadata #8464

Closed
alamb opened this issue Dec 7, 2023 · 1 comment · Fixed by #8535
Closed

Improve string statistics display in datafusion-cli parquet_metadata #8464

alamb opened this issue Dec 7, 2023 · 1 comment · Fixed by #8535
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 7, 2023

Is your feature request related to a problem or challenge?

@Veeupup implemented the great parquet_metadata feature in #8413 ❤️

While playing around with it however, I noticed that the formatting of string statistics was not super easy to interpret as it is formatted something like [72, 101, 108, 108, 111]

For example:

andrewlamb@Andrews-MBP:~/Software/arrow-datafusion$ datafusion-cli -c "select * from parquet_metadata('parquet-testing/data/data_index_bloom_encoding_stats.parquet')";
DataFusion CLI v33.0.0
+--------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+--------------------------+--------------------------+------------------+----------------------+--------------------------+--------------------------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
| filename                                                     | row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | column_id | file_offset | num_values | path_in_schema | type       | stats_min                | stats_max                | stats_null_count | stats_distinct_count | stats_min_value          | stats_max_value          | compression        | encodings                | index_page_offset | dictionary_page_offset | data_page_offset | total_compressed_size | total_uncompressed_size |
+--------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+--------------------------+--------------------------+------------------+----------------------+--------------------------+--------------------------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
| parquet-testing/data/data_index_bloom_encoding_stats.parquet | 0            | 14                 | 1                     | 163             | 0         | 4           | 14         | "String"       | BYTE_ARRAY | [72, 101, 108, 108, 111] | [116, 111, 100, 97, 121] | 0                |                      | [72, 101, 108, 108, 111] | [116, 111, 100, 97, 121] | GZIP(GzipLevel(6)) | [BIT_PACKED, RLE, PLAIN] |                   |                        | 4                | 152                   | 163                     |
+--------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+--------------------------+--------------------------+------------------+----------------------+--------------------------+--------------------------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
1 row in set. Query took 0.024 seconds.

Describe the solution you'd like

It would be nice if the output was formatted as an actual string for string arrays. For example as duckdb does (showing that [72, 101, 108, 108, 111] as Hello

andrewlamb@Andrews-MBP:~/Software/arrow-datafusion$ duckdb -c "select * from parquet_metadata('parquet-testing/data/data_index_bloom_encoding_stats.parquet')";
┌──────────────────────┬──────────────┬────────────────────┬──────────────────────┬─────────────────┬───────────┬─────────────┬────────────┬────────────────┬───┬─────────────────┬─────────────────┬─────────────┬──────────────────────┬───────────────────┬──────────────────────┬──────────────────┬──────────────────────┬──────────────────────┐
│      file_name       │ row_group_id │ row_group_num_rows │ row_group_num_colu…  │ row_group_bytes │ column_id │ file_offset │ num_values │ path_in_schema │ … │ stats_min_value │ stats_max_value │ compression │      encodings       │ index_page_offset │ dictionary_page_of…  │ data_page_offset │ total_compressed_s…  │ total_uncompressed…  │
│       varchar        │    int64     │       int64        │        int64         │      int64      │   int64   │    int64    │   int64    │    varchar     │   │     varchar     │     varchar     │   varchar   │       varchar        │       int64       │        int64         │      int64       │        int64         │        int64         │
├──────────────────────┼──────────────┼────────────────────┼──────────────────────┼─────────────────┼───────────┼─────────────┼────────────┼────────────────┼───┼─────────────────┼─────────────────┼─────────────┼──────────────────────┼───────────────────┼──────────────────────┼──────────────────┼──────────────────────┼──────────────────────┤
│ parquet-testing/da…  │            0 │                 14 │                    1 │             163 │         0 │           4 │         14 │ String         │ … │ Hello           │ today           │ GZIP        │ BIT_PACKED, RLE, P…  │                   │                      │                4 │                  152 │                  163 │
├──────────────────────┴──────────────┴────────────────────┴──────────────────────┴─────────────────┴───────────┴─────────────┴────────────┴────────────────┴───┴─────────────────┴─────────────────┴─────────────┴──────────────────────┴───────────────────┴──────────────────────┴──────────────────┴──────────────────────┴──────────────────────┤
│ 1 rows                                                                                                                                                                                                                                                                                                                       23 columns (18 shown) │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
andrewlamb@Andrews-MBP:~/Software/arrow-datafusion$

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Dec 7, 2023
@alamb
Copy link
Contributor Author

alamb commented Dec 7, 2023

I think this would be a good first issue as there is already an existing test framework and the code is there, it just needs to be updated to improve the display

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
1 participant