Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose SortingColumn in parquet files #3103

Merged
merged 9 commits into from
Nov 15, 2022
Merged

Expose SortingColumn in parquet files #3103

merged 9 commits into from
Nov 15, 2022

Conversation

askoa
Copy link
Contributor

@askoa askoa commented Nov 13, 2022

Which issue does this PR close?

Closes #3090

Reading from file:

The function footer.rs#decode_metadata read sorting_column from file. However the function RowGroupMetaData::from_thrift was not reading the field into RowGroupMetaData. The function was modified to read the sorting_column into RowGroupMetaData

let t_file_metadata: TFileMetaData = TFileMetaData::read_from_in_protocol(&mut prot)
.map_err(|e| ParquetError::General(format!("Could not parse metadata: {}", e)))?;
let schema = types::from_thrift(&t_file_metadata.schema)?;
let schema_descr = Arc::new(SchemaDescriptor::new(schema));
let mut row_groups = Vec::new();
for rg in t_file_metadata.row_groups {
row_groups.push(RowGroupMetaData::from_thrift(schema_descr.clone(), rg)?);
}
let column_orders = parse_column_orders(t_file_metadata.column_orders, &schema_descr);

Writing into file:

The function format.rs#RowGroup#write_to_out_protocol writes sorting_column to file. However the function metadata.rs#RowGroupMetaData#to_thrift was not writing the field to RowGroup. The function was modified to write sorting_column from RowGroupMetaData to RowGroup

arrow-rs/parquet/src/format.rs

Lines 4155 to 4162 in 430eb84

if let Some(ref fld_var) = self.sorting_columns {
o_prot.write_field_begin(&TFieldIdentifier::new("sorting_columns", TType::List, 4))?;
o_prot.write_list_begin(&TListIdentifier::new(TType::Struct, fld_var.len() as i32))?;
for e in fld_var {
e.write_to_out_protocol(o_prot)?;
}
o_prot.write_list_end()?;
o_prot.write_field_end()?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Nov 13, 2022
@askoa askoa marked this pull request as ready for review November 13, 2022 23:57
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get an end-to-end test of this, i.e. write a parquet file to a Vec<u8> then read it back and verify the sort column was round-tripped correctly

parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
Comment on lines 383 to 368
value: Option<Vec<SortingColumnMetaData>>,
) -> Self {
self.sorting_columns = value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
value: Option<Vec<SortingColumnMetaData>>,
) -> Self {
self.sorting_columns = value;
value: Vec<SortingColumnMetaData>,
) -> Self {
self.sorting_columns = Some(value);

This is consistent with set_page_offset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a reverse review before. The reason given by the reviewer was that if we remove Option from signature then the function cannot be used to set None for this field. I am going to keep this as is.

parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
@askoa askoa marked this pull request as draft November 14, 2022 13:57
@askoa askoa marked this pull request as ready for review November 15, 2022 01:26
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting close 😄

parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/tests/parquet_round_trip.rs Outdated Show resolved Hide resolved
parquet/tests/parquet_round_trip.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/tests/parquet_round_trip.rs Outdated Show resolved Hide resolved
parquet/tests/parquet_round_trip.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata.rs Outdated Show resolved Hide resolved
@tustvold tustvold mentioned this pull request Nov 15, 2022
@tustvold tustvold merged commit 371ec57 into apache:master Nov 15, 2022
@tustvold
Copy link
Contributor

Thank you 👍

@ursabot
Copy link

ursabot commented Nov 15, 2022

Benchmark runs are scheduled for baseline = 8bb2917 and contender = 371ec57. 371ec57 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jimexist pushed a commit that referenced this pull request Nov 16, 2022
* Expose SortColumn from parquet file

* fix formatting issues

* empty commit

* fix PR comments

* formatting fix

* add parquet round trip test

* fix clippy error

* update the test based on PR comment

Co-authored-by: askoa <askoa@local>
@askoa askoa deleted the sorting-column branch December 1, 2022 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose SortingColumn when reading and writing parquet metadata
3 participants