Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally coerce names of maps and lists to match Parquet specification #6828

Merged
merged 5 commits into from
Dec 5, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Dec 3, 2024

Which issue does this PR close?

Closes #6213.
Closes #6733.

Rationale for this change

Allow for forcing use of compliant names for Parquet LIST and MAP fields.

What changes are included in this PR?

Uses the coerce_types boolean property added in #6313 to control naming of elements of LIST and MAP structures. Also adds a coerce_types flag to parquet-rewrite.

Are there any user-facing changes?

No API changes, but change to behavior when coerce_types is true.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Dec 3, 2024
@etseidl etseidl changed the title Pptionally coerce names of maps and lists to match Parquet specification Optionally coerce names of maps and lists to match Parquet specification Dec 3, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Dec 3, 2024

Note this would supersede #6814.

/// Writers have the option to coerce these into native Parquet types. Type
/// coercion allows for meaningful representations that do not require
/// downstream readers to consider the embedded Arrow schema. However, type
/// Also, for [`List`] and [`Map`] types, Parquet expects certain schema elements
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should most of this verbiage move to set_coerce_types on the builder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or set_coerce_types could link to here

Copy link
Contributor Author

@etseidl etseidl Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or set_coerce_types could link to here

I already made that change. My concern is that most of the deep documentation is currently on the setters rather than the getters...this one is an outlier.

Edit: I went ahead and moved the bulk of the documentation to the builder. I think more eyes will find it there.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test using ArrowWriter to check everything is hooked up correctly might be valuable

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me -- thanks @etseidl -- I'll rebase #6840

/// This type coercion allows for meaningful representations that do not require
/// downstream readers to consider the embedded Arrow schema, and can allow for greater
/// compatibility with other Parquet implementations. However, type
/// coercion also prevents the data from being losslessly round-tripped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this wording too

@alamb
Copy link
Contributor

alamb commented Dec 5, 2024

I just restarted the failed jobs and will try and merge this PR in when complete

@alamb alamb merged commit 30c14ab into apache:main Dec 5, 2024
16 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 5, 2024

Thanks @etseidl

@etseidl etseidl deleted the coerce_names branch December 6, 2024 01:04
@ggreco
Copy link

ggreco commented Dec 6, 2024

Thanks @etseidl for your work!!!

This is gonna be published in the next minor release or it's considered a major change and will be out in v54?

Given the fact it's an optional behaviour (we need to call set_coerce_types before writing the parquet file) and the default is the original behaviour can we expect this to be a minor release?

@alamb
Copy link
Contributor

alamb commented Dec 6, 2024

This is gonna be published in the next minor release or it's considered a major change and will be out in v54?
Given the fact it's an optional behaviour (we need to call set_coerce_types before writing the parquet file) and the default is the original behaviour can we expect this to be a minor release?

The next release is scheduled to be a major one (54) so I expect this to be out in 54 and thus I haven't thought carefully if it would be classified as major/minor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Option To Coerce List Type on Parquet Write Add Option To Coerce Map Type on Parquet Write
4 participants