-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parquet-layout binary #3269
Conversation
I will review this tomorrow -- the other one I know of is https://github.com/manojkarthick/pqrs which shows promise |
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Binary that prints the physical layout of a parquet file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
As a general musing, I think some of these cli helpers are quite nice and maybe eventually we could make them more discoverable / nicer to people who are not working with the parquet source code.
I didn't find anything other than https://github.com/apache/arrow-rs/tree/master/parquet/src/bin for documentation
Perhaps to prqs
or something similar 🤔
|
||
let end = start + column.compressed_size() as u64; | ||
while start != end { | ||
let (header_len, header) = read_page_header(reader, start)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is neat
num_values: data_page.num_values, | ||
}) | ||
} else if let Some(data_page) = header.data_page_header_v2 { | ||
let is_compressed = data_page.is_compressed.unwrap_or(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it really compressed by default? I expected unwrap_or(false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the column is compressed, and the header doesn't specify, the default is compressed. I think this was a later extension to make it optional
Benchmark runs are scheduled for baseline = b155461 and contender = 94d597e. 94d597e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
Frequently when debugging an issue the first port of call is working out the physical layout of the data in the parquet file, what indexes are present, what encodings are being used, how large the pages are, etc...
I have been unable to find such a tool, so I quickly wrote one up to replace the ad-hoc code I keep having to write 😅
parquet-testing/data/nested_lists.snappy.parquet
parquet-testing/data/data_index_bloom_encoding_stats.parquet
parquet-testing/data/alltypes_dictionary.parquet
What changes are included in this PR?
Are there any user-facing changes?