-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat:Add function for row alignment with page mask #1791
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1791 +/- ##
==========================================
+ Coverage 83.39% 83.42% +0.02%
==========================================
Files 198 199 +1
Lines 56142 56427 +285
==========================================
+ Hits 46821 47075 +254
- Misses 9321 9352 +31
Continue to review full report at Codecov.
|
Co-authored-by: Liang-Chi Hsieh <[email protected]>
Thank you, I will review this tomorrow (GMT) |
} | ||
|
||
/// Return the row ranges `Vec(start, len)` of all the selected pages | ||
pub fn compute_row_ranges( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this be used?
Do you need to make something like with_predicate
in ReadOptionsBuilder
to take a closure for filtering pages based on the ranges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, basically add a closure called page_predicates
in ReadOptionsBuilder
like
Box<dyn FnMut(&[pageIndex], &[pageLocation], usize) -> &[bool]>
to generate mask then call this function compute_row_ranges
.
I'm thinking about doing this at the end for testing with datafusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, left some high level comments. Will review more thoroughly tomorrow
@@ -223,6 +223,7 @@ pub struct RowGroupMetaData { | |||
num_rows: i64, | |||
total_byte_size: i64, | |||
schema_descr: SchemaDescPtr, | |||
// Todo add filter result -> row range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this the more I wonder whether the metadata structs are the right place to put the index information. They're parsed and interpreted separately from the main metadata, and so I think it makes sense for them to be stored separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. page index stored in file-meta level.
My thought is read less pageIndex after rowgroup filter
arrow-rs/parquet/src/file/serialized_reader.rs
Lines 211 to 224 in be38829
let mut filtered_row_groups = Vec::<RowGroupMetaData>::new(); | |
for (i, rg_meta) in row_groups.into_iter().enumerate() { | |
let mut keep = true; | |
for predicate in &mut predicates { | |
if !predicate(&rg_meta, i) { | |
keep = false; | |
break; | |
} | |
} | |
if keep { | |
filtered_row_groups.push(rg_meta); | |
} | |
} | |
arrow-rs/parquet/src/file/serialized_reader.rs
Lines 246 to 249 in be38829
metadata: ParquetMetaData::new( | |
metadata.file_metadata().clone(), | |
filtered_row_groups, | |
), |
So i want to read index here and insert it into RowGroupMetaData.
It was just a simple idea at first, maybe we can find a better way in the process of implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page index stored in file-meta level.
It isn't even file-meta level, it isn't part of the footer but stored as separate pages 😅
It was just a simple idea at first, maybe we can find a better way in the process of implementation
Provided we take care to ensure we keep things pub(crate) so we don't break APIs, this seems like a good strategy 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't even file-meta level, it isn't part of the footer but stored as separate pages 😅
yes separately from RowGroup, before the footer !😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, I am, however, a bit confused as to how it will be used... I had expected something like the following.
For each not-pruned row group:
- Use
Index
to identify covering set of rows based on predicates - Pass row selection down to RecordReader
- Add a
skip_next_page
toPageReader
- Add a
skip_values
toColumnValueDecoder
- Have RecordReader use a combination of the above to skip pages and rows during decode based on the row selection
This would also naturally extend to #1191
I'm not entirely sure where the datastructure added in this PR would fit into this?
Edit: is the intention to use this for the first step?
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Basically, yes ! 👍
First step contains:
@tustvold I will follow up in PageReader |
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Which issue does this PR close?
Closes #1790.
Rationale for this change
For now row group filter in datafusion pass a closure to arrow-rs
https://github.com/apache/arrow-datafusion/blob/585bc3a629b92ea7a86ebfe8bf762dbef4155710/datafusion/core/src/physical_plan/file_format/parquet.rs#L559-L562
So for page filter in datafusion, define filter_predicate
datafusion will send a
mask
(&[bool]) to arrow-rs,then use mask call
compute_row_ranges
to constructRowRanges
: row ranges in a row-group (one col) if col is sorted vec size will be 1.For multi filter combine:
if there are two filters use
and
connect,useRowRanges::intersection
to get the final rowRange; two filters useor
connect,useRowRanges::union
to get the final rowRange.After this PR: i will working on
column_page_reader
, enable read specific row ranges record.What changes are included in this PR?
Are there any user-facing changes?