-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: extract floats for parquet data pages #10972
Conversation
@@ -507,9 +507,84 @@ async fn test_multiple_data_pages_nulls_and_negatives() { | |||
|
|||
/////////////// MORE GENERAL TESTS ////////////////////// | |||
// . Many columns in a file | |||
// . Differnet data types | |||
// . Different data types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive by cleanup 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tshauck
Thanks LGTM. I think only one test case should check both instead of only row_groups?
expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]), | ||
expected_row_counts: Some(UInt64Array::from(vec![5, 5, 5, 5])), | ||
column_name: "f", | ||
check: Check::RowGroup, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be Check::Both
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think you're right, but doing that leads to a test assertion failure 🤔 ... gonna move the PR back draft while I look further into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is due to the difference in handling row_counts for RowGroups and DataPages. #10965
Im working on this - for now you'd need two test cases, one for each check with different expected row_counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm stretching a here, but doing a bit of looking I think it's due to the physical type for an f16 being a fixed length byte array. I think column_page_index_per_row_group_per_column
is an Index::FIXED_LEN_BYTE_ARRAY
which seemingly isn't playing well with some of the conversion, so the values coming through end up being None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you seem to be right here; I left a comment in the code - where I think you might have missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW #10965 is now fixex
Index::FIXED_LEN_BYTE_ARRAY, | ||
f16 | ||
); | ||
make_data_page_stats_iterator!(MinFloat32DataPageStatsIterator, min, Index::FLOAT, f32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Float
(f32) and Double
(f64) would be enough, in terms of naming. This would also be consistent with the row group stats iterators.
@@ -552,6 +552,22 @@ make_data_page_stats_iterator!(MinInt32DataPageStatsIterator, min, Index::INT32, | |||
make_data_page_stats_iterator!(MaxInt32DataPageStatsIterator, max, Index::INT32, i32); | |||
make_data_page_stats_iterator!(MinInt64DataPageStatsIterator, min, Index::INT64, i64); | |||
make_data_page_stats_iterator!(MaxInt64DataPageStatsIterator, max, Index::INT64, i64); | |||
make_data_page_stats_iterator!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tshauck
It seems like you are not using those in the maching down below? that might cause your error on f16. Instead you are using an Float32 iterator which won't match on Index::FIXED_LEN_BYTE_ARRAY and thus return None.
I think you need to switch to the f16 iterator; but perhaps might need to construct an f16 from the bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the confusion. It pushed this as a WIP. I still need to work through the f16 case. I think I will need to construct an f16 from the bytes, but not sure yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we take a friendly look at how f16 for row groups work:
datafusion/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Lines 336 to 340 in 1cb0057
DataType::Float16 => Ok(Arc::new(Float16Array::from_iter( | |
[<$stat_type_prefix FixedLenByteArrayStatsIterator>]::new($iterator).map(|x| x.and_then(|x| { | |
from_bytes_to_f16(x) | |
})), | |
))), |
I suspect the statistics in data pages will be encodd the same
It is also possible that since f16 is a less common type, the parquet writer might write data page statistics for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi there, I haven't noticed this PR and accidentally addressed the same issue in here #10982 . Apologies. I think the f32 and f64 cases we solve the ~same. I got the f16 working, though not in a way I'm particularly happy with (ie, the move issues you mention)
I don't have any other commits in that PR in my queue -- so happy to close it if you come up with any other f16 solution...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooff, sorry @tmi, I should've done take
on the issue. Taking @alamb's pointer to look at row groups, I might be wrong, but I think there's a couple of differences between it and the data page stats that complicate things...
max_bytes
/min_bytes
, which is used for row groups, does not exist onPageIndex
in the parquet crate but does onStatistics
, so using it in the macro isn't possible at this point.- Using
[u8]
as thestat_value_type
is problematic when combined with the iterator macro in thattype Item = Vec<Option<$stat_value_type>>;
can't work with[u8]
because it's unsized. Changing the item def totype Item = Vec<Option<&'a $stat_value_type>>;
works and is similar to the item definition of row group (type Item = Option<&'a $stat_value_type>;
).
f16 is less common, so maybe to have a solution that's similar to Statistics and hopefully avoids cloning, we could:
- Update this PR to do f32 and f64, get it in to support higher priority types and is mostly innocuous
- Update the parquet crate to support
max_bytes
/min_bytes
onPageIndex
- Update the data page stats to use a
&'a $stat_value_type
and add f16 when that's merged
@tmi, maybe you and I could collaborate on 2 & 3 if that seems reasonable? Otherwise, certainly happy to go another route, if there's a better approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan to me
I think it is perfectly reasonable to leave f16
support as a 'TODO' in datafusion as we add the needed support upstream
Let me know if I can help (e.g. file some tickets, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #10982 is good to go FWIW
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good to me. Maybe a further improvement down the line to avoid the clone and simplify to use a method vs a function. I'll close this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid the clone
I think the ideal sequence would be
- Tracking Issue for
f16
andf128
float types rust-lang/rust#116909 gets merged - parquet replaces FixedLengthByteArray with genuine f16 type
- we drop the special handling here and elsewhere
and until then the f16 support in datafusion would be as is... Or alternatively, introduce the half crate to parquet rs first. Because dealing with FixedLengthByteArray on the user side is always going to be messy
Which issue does this PR close?
Closes #10951
Rationale for this change
Serves the overall epic #10922
What changes are included in this PR?
get_data_page_statistics
to include f16, f32, and f64null_counts_page_statistics
to handle floatsAre these changes tested?
Yes, added new tests.
Are there any user-facing changes?
No