-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet RowFilter API #2335
Conversation
arrow_schema: SchemaRef, | ||
mask: ProjectionMask, | ||
row_groups: Box<dyn RowGroupCollection>, | ||
row_groups: &dyn RowGroupCollection, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by cleanup
@@ -110,8 +110,8 @@ pub trait RowGroupCollection { | |||
} | |||
|
|||
impl RowGroupCollection for Arc<dyn FileReader> { | |||
fn schema(&self) -> Result<SchemaDescPtr> { | |||
Ok(self.metadata().file_metadata().schema_descr_ptr()) | |||
fn schema(&self) -> SchemaDescPtr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by cleanup
use arrow::record_batch::RecordBatch; | ||
|
||
/// A predicate operating on [`RecordBatch`] | ||
pub trait ArrowPredicate: Send + 'static { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to make things more extensible in the long run
@@ -349,22 +323,13 @@ impl RecordBatchReader for ParquetRecordBatchReader { | |||
} | |||
|
|||
impl ParquetRecordBatchReader { | |||
pub fn try_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module is not public, and this method was only being used in one place, so we can just remove it
) -> Result<RowSelection> { | ||
let reader = | ||
ParquetRecordBatchReader::new(batch_size, array_reader, selection.clone()); | ||
let mut filters = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could theoretically keep the decoded arrays around, but requires a lot of non-trivial take
+ concat
in order to sync up the yielded batches. It also potentially balloons the memory consumption. I decided it was not worth it
/// Once all predicates have been evaluated, the resulting [`RowSelection`] will be | ||
/// used to return just the desired rows. | ||
/// | ||
/// This design has a couple of implications: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the major change vs #2310, FYI @thinkharderdev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I like this
self, | ||
selection: impl Into<Vec<RowSelection>>, | ||
) -> Self { | ||
/// TODO: Revisit this API, as [`Self`] is provided before the file metadata is available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to revisit this as part of the next (21) arrow release, I suspect we can move to a builder and deprecate the current API which is quite clunky
fn into(self) -> VecDeque<RowSelector> { | ||
self.selectors.into() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file definitely needs some tests prior to merge. The code is largely lifted from #2201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I didn't review the logic in detail either as I figured we are just at the "API feedback" design phase
vec![None; row_group_metadata.columns().len()]; | ||
|
||
// TODO: Combine consecutive ranges | ||
let fetch_ranges = (0..column_chunks.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is moved into InMemoryRowGroup::fetch
FYI @thinkharderdev @alamb @crepererum @Ted-Jiang I'd appreciate any feedback you might have on this |
row_group | ||
.fetch(&mut self.input, meta, &projection, selection.as_ref()) | ||
.await?; | ||
|
||
let reader = ParquetRecordBatchReader::new( | ||
batch_size, | ||
build_array_reader(self.schema.clone(), projection, &row_group)?, | ||
selection, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require decoding the filter columns twice (or multiple times in the case where we have multiple predicates) right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a column appears in multiple predicates and/or the final projection, it will need to be decoded multiple times. I don't really see a way around this, keeping the data around and doing take + concat
adds significant complexity, and it is unclear that it would necessarily be faster.
Eventually it might be possible to push simple predicates down to operate directly on the encoded data, which would avoid this. But that is a wee ways off 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think in many cases the difference would be negligible but in the degenerate cases (lots of predicates, filters which don't do much filtering, etc) I think it could potentially add up. The reason I worry about it in general is that we would have to rely on the engine to determine which predicates to apply and in which order. And in a situation where all we have is row group metadata we don't have a ton to go on.
I took a crack at seeing what it might look like preserving the decoded arrays and came up with tustvold#24. It certainly involves a lot of array slicing and dicing but the complexity seems manageable and would help ensure that applying filters doesn't ever come with a significant performance cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to eliminate redundant decoding is a good idea for the reasons that @thinkharderdev
Conveniently, it seems like nothing in the API of this PR requires decoding multiple times, so I think we could also potentially implement the 'use take rather than redundant decode' in a follow on PR as well.
In terms of "Eventually it might be possible to push simple predicates down to operate directly on the encoded data, which would avoid this." I agree it is a ways off however, I think it could fit into this API with something like adding a list of ParquetFilter
s to apply during the decode itself that could be efficiently implemented
/// Filter that is applied during decoding of a single column
/// semantically takes the form <col> <op> <constant>
struct ParquetFilter {
op: ParquetOp
left: ParquetConst,
}
enum ParquetFilterOp { EQ, NEQ }
enum ParqeutConst {
Int64(i64)
Float(f64)
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conveniently, it seems like nothing in the API of this PR requires decoding multiple times, so I think we could also potentially implement the 'use take rather than redundant decode' in a follow on PR as well.
I agree. As @tustvold points out, we need to be careful about memory overhead so maybe the best course is to go with the current approach and tackle avoiding the redundant decoding in a follow up. I prototyped the DataFusion piece (based on my other draft PR but it should be roughly similar) to see how it affected our internal benchmarks and saw a roughly 50% improvement on queries with reasonably selective predicates. We went from being mostly CPU bound with parquet decoding to being mostly IO bound which means I expect there is even more room for improvement once we are using the selection and the page index to avoid IO altogether. That's all to say I'm super excited about this work and think it will be a huge step forward!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create a follow up ticket to investigate this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We went from being mostly CPU bound with parquet decoding to being mostly IO bound which means I expect there is even more room for improvement once we are using the selection and the page index to avoid IO altogether. That's all to say I'm super excited about this work and think it will be a huge step forward!
That is terrific news 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also really like this API -- 👍 @tustvold
thanks for the feedback @thinkharderdev . I think this design will be better due to all the community feedback ❤️
fn into(self) -> VecDeque<RowSelector> { | ||
self.selectors.into() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I didn't review the logic in detail either as I figured we are just at the "API feedback" design phase
/// Row group filtering is applied prior to this, and rows from skipped | ||
/// row groups should not be included in the [`RowSelection`] | ||
/// | ||
/// TODO: Make public once stable (#1792) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also a good idea to link to the docs that describe the order of filter application in decoding (RowSelection followed by RowFilter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc links should do this automatically?
row_group | ||
.fetch(&mut self.input, meta, &projection, selection.as_ref()) | ||
.await?; | ||
|
||
let reader = ParquetRecordBatchReader::new( | ||
batch_size, | ||
build_array_reader(self.schema.clone(), projection, &row_group)?, | ||
selection, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trying to eliminate redundant decoding is a good idea for the reasons that @thinkharderdev
Conveniently, it seems like nothing in the API of this PR requires decoding multiple times, so I think we could also potentially implement the 'use take rather than redundant decode' in a follow on PR as well.
In terms of "Eventually it might be possible to push simple predicates down to operate directly on the encoded data, which would avoid this." I agree it is a ways off however, I think it could fit into this API with something like adding a list of ParquetFilter
s to apply during the decode itself that could be efficiently implemented
/// Filter that is applied during decoding of a single column
/// semantically takes the form <col> <op> <constant>
struct ParquetFilter {
op: ParquetOp
left: ParquetConst,
}
enum ParquetFilterOp { EQ, NEQ }
enum ParqeutConst {
Int64(i64)
Float(f64)
...
}
async fn read_row_group( | ||
mut self, | ||
row_group_idx: usize, | ||
mut selection: Option<RowSelection>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a situation we need read row groups with previous row group selection? 🤔
Edit: sorry it may come from pageIndex
, forgive me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need the previous each filter rate 😂(just a idea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous each filter rate
I'm not sure what you mean?
projection: &ProjectionMask, | ||
_selection: Option<&RowSelection>, | ||
) -> Result<()> { | ||
// TODO: Use OffsetIndex and selection to prune pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this avoid huge IO work in some situation make pageIndex more useful !
I think it needs takes a lot of testing to decide when use random skip reads
I think this is now good for review, it isn't public (yet) and so doesn't need to be perfect, but I think we can continue to iterate on this. I will file follow up tickets for follow on work tomorrow. |
/// with `true` values in the returned [`BooleanArray`] indicating rows | ||
/// matching the predicate. | ||
/// | ||
/// All row that are `true` in returned [`BooleanArray`] will be returned to the reader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typo: All row
is probably meant to be All rows
(missing "s")
/// | ||
/// All row that are `true` in returned [`BooleanArray`] will be returned to the reader. | ||
/// Any rows that are `false` or `Null` will not be | ||
fn filter(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be named fn filter_array
instead to better indicate that the result would be a Boolean filter array instead of actually filtering the RecordBatch passed in as the batch parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think both names are kind of confusing tbh, I'll rename it to evaluate
as I think that should be clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR looks good to go to me so we can start hooking it up and getting everything ready.
I definitely have some questions about the and
test -- but since this code isn't used yet I don't think it would block merge.
@Ted-Jiang or @thinkharderdev do you have any other comments or suggestions? We can also address any additional changes in a follow on PR as this one is already fairly large
RowSelector::skip(4), | ||
]); | ||
|
||
let mut expected = RowSelection::from(vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected answer doesn't make sense to me. When I did it out there seems to be something wrong
N = Skip
Y = Select
a: NNNNNNNNNNNNYYYYYYYYYYYYYYYYYYYYYYNNNYYYYY
b: YYYYYNNNNYYYYYYYYYYYYYYYNNN
What is here:
e: NNNNNNNNNNNNYYYYYNNNNYYYYYYYYYYYYYYNNYNNNN
What I think the answer should be:
e: NNNNNNNNNNNNYYYYYYYYYYYYNNNYYYYYYYNNNYYYYY
Though to be honest I am not sure sure what an AND
and nulls should be
I am probably missing something obvious here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks right to me. The and
should just be giving you the result of applying the filters sequentially. Visually, it makes sense (to me) like
a: NNNNNNNNNNNNYYYYYYYYYYYYYYYYYYYYYYNNNYYYYY
b: YYYYYNNNNYYYYYYYYYYYYY YYNNN
NNNNNNNNNNNNYYYYYNNNNYYYYYYYYYYYYYYNNYNNNN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps and
is the wrong name for this function? I will incorporate @thinkharderdev 's example as a comment as I think it is helpful 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think and
is a little confusing since we're dealing with a boolean array and it is not clear from the name that this is essentially a composition operator. Maybe and_then
?
I think this should is good to merge. |
Benchmark runs are scheduled for baseline = 4481993 and contender = 21ba02e. 21ba02e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice 👌
} | ||
|
||
/// Given a [`RowSelection`] computed under `self`, returns the [`RowSelection`] | ||
/// representing their conjunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the wording saying this is a conjunction
is misleading, it is more like the conjunction of the subsequent filters only with the rows that were selected previously
Draft as needs a lot more test coverage and general cleanupWhich issue does this PR close?
Closes #2270
Rationale for this change
What changes are included in this PR?
This adds a
RowFilter
API and refines the existingRowSelection
API. There are a couple of things worth highlighting hereRowFilter
is pushed down to the IO level. @crepererum gave a good use case of if this allows eliminating an entire column chunk from consideration, etc...RecordBatchReaderBuilder
as the current API can't really be used for this purposeAre there any user-facing changes?