-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes to ParquetRecordBatchStream
to support row filtering in DataFusion
#2270
Comments
Could you perhaps expand upon what you mean by streaming filtering? My understanding of the two major use-cases for the row skipping functionality were:
As such I'm not entirely sure why you would need to "refine" your selection whilst reading, or even really what this would mean? FWIW I created #2201 to give a basic example of how filter pushdown might work beyond what is supported by the page index.
@Ted-Jiang is working on fixing this in #2237 |
So basically the implementation in DataFusion is:
Using your example, you would need to decode/filter the entire row group for your filter columns then apply the predicate to generate the row selection for the remaining columns. So you'd have to buffer the entire row group in memory for those columns. That's not too terrible but it seems like it would add overhead and make the decision of whether to try and apply a row-level filter higher stakes. I though that ideally we would be able to do the filtering one batch at a time. That should be very little overhead (aside from computing the filter expression which has to be done at some point anyway) so we can always apply a row filter if one is available. |
I need to think more on this, but some immediate thoughts that may or may not make sense:
|
Agreed, the ideal scenario would be to use the page index to avoid the IO altogether. I think the tradeoff there will be around when we fetch the data. Using buffered prefetch has been a big performance improvement for us and avoiding the eager IO in order to possible fetch less data could be a net loss depending on how much you can reduce the data fetched.
Yeah, it would be nice to be able to just pass a
Agreed, it would be great to just be able to pass a predicate in an expression language defined in
Interesting, when is this the case? Would there be situations where the decoder couldn't optimize based on the number of values to skip? |
It needs to be benchmarked but with the exception of PLAIN encoded primitives, skipping over values still requires reading block headers, interpreting them, etc... and so isn't without cost. If you're then reading very small slices the dispatch overheads will start to add up. |
ParquetRecordBatchStream
to support row filtering in DataFusion
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The current (non-public) API for skipping records has some shortcomings when trying to implement predicate pushdown from DataFusion. It assumes that we have the entire
VecDequeue<RowSelection>
for an entire row group when construction theParquetRecordBatchReader
which makes doing streaming filtering a challenge.Describe the solution you'd like
I wrote a prototype of row-level filtering in DataFusion that required a few changes to the API here. Two main things:
ParquetRecordBatchStream
which mimics
poll_next
but applies a selection to the next batch.ParquetRecordBatchReader
which mimics
next
but applies a selection and, importantly, emits a single batch containing all selected rows inselection
. The current implementation just emits the nextRowSelection
that selects and stops.See https://github.com/coralogix/arrow-datafusion/blob/row-filtering/datafusion/core/src/physical_plan/file_format/parquet.rs for the (very hacky) protoype in DataFusion built on https://github.com/coralogix/arrow-rs/tree/row-filtering.
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: