-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ArrowToParquetSchemaConverter
, deprecate arrow_to_parquet_schema
#6840
Conversation
b0d96be
to
07a754c
Compare
arrow_to_parquet_schema_with_root(schema, "arrow_schema", coerce_types) | ||
} | ||
#[deprecated(since = "54.0.0", note = "Use `ArrowToParquetSchemaConverter` instead")] | ||
pub fn arrow_to_parquet_schema(schema: &Schema) -> Result<SchemaDescriptor> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
/// Convert arrow schema to parquet schema specifying the name of the root schema element | ||
pub fn arrow_to_parquet_schema_with_root( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out this function is not actually exported (it is pub in this module, but not pub exported):
https://docs.rs/parquet/latest/parquet/?search=arrow_to_parquet_schema_with_root
Returns no results
The compiler told me it was unused once I switched everything over to use ArrowToParquetSchemaConverter
parquet/src/file/properties.rs
Outdated
/// | ||
/// Setting this option to `true` will result in parquet files that can be | ||
/// read by more readers, but may lose precision for arrow types such as | ||
/// [`DataType::Date64`] which have no direct corresponding Parquet type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also felt it would help if we described more explicitly what the impact of enabling this option was (and left a link to the longer backstory)
07a754c
to
d1c78c2
Compare
d1c78c2
to
29d2509
Compare
29d2509
to
97d8991
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big improvement, thanks!
parquet/src/arrow/schema/mod.rs
Outdated
@@ -225,27 +225,130 @@ pub(crate) fn add_encoded_arrow_schema_to_metadata(schema: &Schema, props: &mut | |||
} | |||
} | |||
|
|||
/// Convert arrow schema to parquet schema | |||
/// Converter for arrow schema to parquet schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs are pretty inconsistent on whether "arrow" and "parquet" are capitalized or not. Do you have a preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should strive for always capitalizing them Arrow
and Parquet
-- i will fix
parquet/src/file/properties.rs
Outdated
/// Setting this option to `true` will result in parquet files that can be | ||
/// read by more readers, but may lose precision for arrow types such as | ||
/// [`DataType::Date64`] which have no direct corresponding Parquet type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still mention the naming impacts here. There seem to be a number of users who want Parquet naming, and I think this is the first place they'd look to figure out how to do so. I'm not sure I'd think to follow the link to the converter when I want my lists named properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - I pushed e7b7d20 to try and clarify -- let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
parquet/src/arrow/schema/mod.rs
Outdated
let converted_arrow_schema = arrow_to_parquet_schema(&arrow_schema, true).unwrap(); | ||
let converted_arrow_schema = ArrowToParquetSchemaConverter::new(&arrow_schema) | ||
.with_coerce_types(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love how this makes what's going on more explicit.
Co-authored-by: Ed Seidl <[email protected]>
Co-authored-by: Ed Seidl <[email protected]>
…to alamb/parquet_coverter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review @etseidl
parquet/src/file/properties.rs
Outdated
/// Setting this option to `true` will result in parquet files that can be | ||
/// read by more readers, but may lose precision for arrow types such as | ||
/// [`DataType::Date64`] which have no direct corresponding Parquet type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - I pushed e7b7d20 to try and clarify -- let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the doc CI failure to fix. Otherwise looks ready to go (mod capitalization which can be a follow on 😉). Thanks again @alamb.
Co-authored-by: Ed Seidl <[email protected]>
…to alamb/parquet_coverter
Thank you 🙏 I fixed a few more capitalization inconsistencies and added a link to the date section of the parquet spec too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did wonder if this should instead be an options struct that gets passed to a method (or has a member function that accepts the actual schema). This would allow using the same config for multiple files, and is perhaps more consistent with other APIs?
Co-authored-by: Raphael Taylor-Davies <[email protected]>
…to alamb/parquet_coverter
This reverts commit bd4e2d5.
I agree, as there is no state held in the converter there is no reason for it to be a consuming builder. I made this change in 0ab4eb4 |
Which issue does this PR close?
Rationale for this change
While working on the upgrade to arrow 54 in DataFusion apache/datafusion#13663 I found it was not clear what the
coerce_types
argument inarrow_to_parquet_schema
meant or what value to pick to keep the existing behavior.I had to go digging on #6313 to find what the default / old behavior was (
true
orfalse
)Turns out the answer is
It is clearly documented on
WriterProperties
but not on the lower level apisThus I would like to propose making a new API for converting schemas where the defaults are clearer and deprecate
arrow_to_parquet_schema
(and revert the unnecessary breaking API change)What changes are included in this PR?
ArrowToParquetSchemaConverter
,arrow_to_parquet_schema
made in feat(parquet)!: coerce_types flag for date64 #6313arrow_to_parquet_schema
with a clear message about what to use instreadAre there any user-facing changes?
A new API and a deprecation of an existing one