You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
for each column, compute the set of pages that fill all intervals
pass pages (and their ranges) to consumers to deserialize to their favorite in-memory format
For c1 these are full pages; for c2 these are slices of pages (slices of p22,p23,p24).
Implementation
Because ColumnIndex has encoded data, we need a trait object Index to describe the
different physical layouts, just like we do for statistics, so that consumers do not need to worry about de-serializing the values prior to using them:
pubtraitIndex:Send + Sync + std::fmt::Debug{fnas_any(&self) -> &dynAny;fnphysical_type(&self) -> &PhysicalType;}/// The index of a page, containing the min and max values of the page.pubstructPageIndex<T>{/// The minimum value in the page. It is None when all values are nullpubmin:Option<T>,/// The maximum value in the page. It is None when all values are nullpubmax:Option<T>,/// The number of null values in the pagepubnull_count:Option<i64>,}/// An index of a column of [`NativeType`] physical representationpubstructNativeIndex<T:NativeType>{pubindexes:Vec<PageIndex<T>>,pubboundary_order:BoundaryOrder,}impl<T:NativeType>IndexforNativeIndex<T>{fnas_any(&self) -> &dynAny{self}fnphysical_type(&self) -> &PhysicalType{&T::TYPE}}/// An index of a column of bytes physical type#[derive(Debug,Clone,PartialEq,Eq,Hash)]pubstructByteIndex{pubindexes:Vec<PageIndex<Vec<u8>>>,pubboundary_order:BoundaryOrder,}
...
For step 1, we need to compute the set of intervals that selects the relevant rows from a set of pages:
PageLocation contains the location of the page in the file (in bytes)
Interval contain the start and length (in number of rows) of the elements
to retrieve from said page. The invariant is start + length <= page.num_rows().
(multiple-column selectors are computed by the overlapping of each column's interval).
For step 2, we need a function to select pages from any column chunk:
/// An enum describing a page that was either selected in a filter pushdown or skipped#[derive(Debug,Clone,Copy,PartialEq,Eq,Hash)]pubenumFilteredPage{Select{/// Location of the page in the file in bytesstart:u64,/// the length of the page in byteslength:usize,/// Location of rows to select in the pagerows_offset:usize,rows_length:usize,},Skip{/// Location of the page in the filestart:u64,/// the length of the page in byteslength:usize,/// number of rows that are skip by skipping this pagenum_rows:usize,},}
With this struct, we have everything we need to implement a reader that skips pages that are not needed. When pages are selected, the rows_offset and rows_length are passed along so that consumers know how which part of the page they must decode.
Consumers (of pages) that apply this filter must be able to decode pages with an offset (Rust's .skip) and length (Rust's .take).
The text was updated successfully, but these errors were encountered:
jorgecarleitao
changed the title
Add support for page-level filter pushdown (indexes)
Added support for page-level filter pushdown (indexes)
Apr 15, 2022
With #100 merged, we can now work on using indexes to support page-level filter pushdown via indexes.
This issue describe my initial ideas about the topic:
Design using column indexes
Say we have two columns in a row group, c1 and c2, with the following page structure:
and that we have a filter over c1 that selects it as follows:
the goal is to iterate over c1 and c2 so that we select rows from them accordingly:
For this, 3 steps come to mind:
I_j = (start, len)
For
c1
these are full pages; forc2
these are slices of pages (slices ofp22,p23,p24
).Implementation
Because
ColumnIndex
has encoded data, we need a trait objectIndex
to describe thedifferent physical layouts, just like we do for statistics, so that consumers do not need to worry about de-serializing the values prior to using them:
For step 1, we need to compute the set of intervals that selects the relevant rows from a set of pages:
where:
PageLocation
contains the location of the page in the file (in bytes)Interval
contain thestart
andlength
(in number of rows) of the elementsto retrieve from said page. The invariant is
start + length <= page.num_rows()
.(multiple-column selectors are computed by the overlapping of each column's interval).
For step 2, we need a function to select pages from any column chunk:
where
FilteredPage
is something likeWith this struct, we have everything we need to implement a reader that skips pages that are not needed. When pages are selected, the
rows_offset
androws_length
are passed along so that consumers know how which part of the page they must decode.Consumers (of pages) that apply this filter must be able to decode pages with an offset (Rust's
.skip
) and length (Rust's.take
).The text was updated successfully, but these errors were encountered: