Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing IntervalMonthDayNanoArray to parquet via Arrow Writer #5849

Closed
marvinlanhenke opened this issue Jun 6, 2024 · 5 comments · Fixed by #5875
Closed

Support writing IntervalMonthDayNanoArray to parquet via Arrow Writer #5849

marvinlanhenke opened this issue Jun 6, 2024 · 5 comments · Fixed by #5875
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@marvinlanhenke
Copy link

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

While working on the support for converting parquet statistics into ArrayRefs in DataFusion (see apache/datafusion#10453).
I noticed that currently the ColumnWriter does not support writing IntervalUnit::MonthDayNano.

This might be the location:

ColumnWriter::FixedLenByteArrayColumnWriter(ref mut typed) => {
let bytes = match column.data_type() {
ArrowDataType::Interval(interval_unit) => match interval_unit {
IntervalUnit::YearMonth => {
let array = column
.as_any()
.downcast_ref::<arrow_array::IntervalYearMonthArray>()
.unwrap();
get_interval_ym_array_slice(array, indices)
}
IntervalUnit::DayTime => {
let array = column
.as_any()
.downcast_ref::<arrow_array::IntervalDayTimeArray>()
.unwrap();
get_interval_dt_array_slice(array, indices)
}
_ => {
return Err(ParquetError::NYI(
format!(
"Attempting to write an Arrow interval type {interval_unit:?} to parquet that is not yet implemented"

Describe the solution you'd like

Support for writing IntervalUnit::MonthDayNano in the ColumnWriter.

Describe alternatives you've considered

Additional context

Related to: #5847

@marvinlanhenke marvinlanhenke added the enhancement Any new improvement worthy of a entry in the changelog label Jun 6, 2024
@alamb alamb changed the title Support for IntervalUnit::MonthDayNano in FixedLenByteArrayColumnWriter Support writing IntervalMonthDayNano arrays to parquet via Arrow Writer Jun 6, 2024
@alamb alamb changed the title Support writing IntervalMonthDayNano arrays to parquet via Arrow Writer Support writing IntervalMonthDayNanoArray to parquet via Arrow Writer Jun 6, 2024
@alamb alamb added the parquet Changes to the parquet crate label Jun 6, 2024
@alamb
Copy link
Contributor

alamb commented Jun 6, 2024

i updated the title of this ticket to reflect the end behavior I think it is addressing

Specifically, I think the gap identified by @marvinlanhenke above is that trying to write an IntervalMonthDayNano array to parquet via https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowWriter.html will not work

TO proceed with this ticket the first thing would probably be to make a small test case to verify that IntervalMonthDayNanoArray can not be written to parquet

@tustvold
Copy link
Contributor

I don't believe the parquet specification allows for supporting nanosecond intervals - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval

I am therefore not sure this ticket is actionable...

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

I wonder what should the guidance be for people who have IntervalMonthDayNano arrays and want to write the data to Parquet 🤔

Is it "cast the data to an interval type that is supported (IntervalMonthDay)? If so I can add a note to the docs

Another potential option would be "write this type in as a FIXED_LENGTH_BYTE_ARRAY or something (with no parquet logical type) -- which would permit round tripping data written by parquet-rs back to ArrayRef but would not be readable by any other implementation

I dug around in arrow and found some suggestions jave doesn't support it either
https://github.com/apache/arrow/blob/65974672a356f34889ed7b9bfb8b76230c27c7ee/java/dataset/src/test/java/org/apache/arrow/dataset/TestAllTypes.java#L94-L96

@tustvold
Copy link
Contributor

cast the data to an interval type that is supported

Documenting this I think this is the least potentially controversial path forward

@alamb
Copy link
Contributor

alamb commented Jun 12, 2024

Proposed documentation update: #5875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
3 participants