-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing IcebergSplitReader #7362
Conversation
✅ Deploy Preview for meta-velox canceled.
|
801e2ce
to
e8b35ab
Compare
_length, | ||
_partitionKeys, | ||
_tableBucketNumber) { | ||
// TODO: Deserialize _extraFileInfo to get deleteFiles; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the relevant information of deleteFile be obtained through string deserialization? What are the benefits of doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best way is to deserialize the incoming split to IcebergSplit with delete files directly, so we don't have to do deserialization from _extraFileInfo again. But this interface was what's decided by Meta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks.
9201e84
to
88c9128
Compare
Recently Arrow lib was partially moved into Parquet writer. Some of the Parquet reader files still depends on some Arrow classes which will be removed in the near future. The namespaces are now moved from arrow:: to ::arrow because it is still depending on the third_party arrow_ep installation and this commit fixes it to avoid build failures.
RowContainerSortBenchmark depends on Parquet reader by default, but Parquet support is controled by VELOX_ENABLE_PARQUET and it's defaulted to OFF. This caused build failures because it tries to include some Arrow headers which is not installed if VELOX_ENABLE_PARQUET is off. This commit guard this benchmark by checking VELOX_ENABLE_PARQUET first.
S3InsertTest.cpp used to include the ParquetReader.h, and caused build failures if VELOX_ENABLE_PARQUET and VELOX_ENABLE_ARROW were not turned on. However the include of ParquetReader.h is not needed. This commit cleans up the includes for S3InsertTest.cpp
The upcoming IcebergSplitReader will need to use fileHandleFactory_, executor_, connectorQueryCtx_, etc because it needs to create another HiveDataSource to read the delete files. This PR copies these required fields to SplitReader. Moreover, since the SplitReader already owns the baseReader_, the creation and configuration of ReaderOptions was also moved to SplitReader in a single function configureReaderOptions(). Previously the configuration of ReaderOptions was scattered at multiple locations in HiveDataSource.
In this commit we introduces IcebergSplitReader which supports reading Iceberg splits with positional delete files.
ConnectorQueryCtx* const connectorQueryCtx_; | ||
uint64_t splitOffset_; | ||
|
||
std::unique_ptr<HiveDataSource> deleteDataSource_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yingsu00 Please use a dwio::common::Reader
instead of the whole data source here. There are a lot of stuff you don't need in a delta file reader: column manipulation, remaining filter, split prefetch, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Thanks for reviewing. Yes I can do that. I'll move the util functions like makeScanSpec, getColumnName, etc to a new HiveReaderUtil class, so that they can be shared by the HiveDataSource and the DeleteFileReader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good. Curious why do you need makeScanSpec
though? You just need a scan spec to read all rows and all columns, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta For positional deletes, we will need to build subfield filters on the path column, as an Iceberg positional delete file can contain deleted rows in multiple base files. Will also add filter on the delete row column, since it's sorted and the lower/upperbound are known, so that perf gets better.
For equality deletes, not every column needs to be read sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need makeScanSpec
for these, just construct the ScanSpec
yourselves (see below code example), it is very small and controllable. I did that for metalake. makeScanSpec
is more for translating the SubfieldFilters
and prunings from Presto representation.
auto spec = std::make_shared<common::ScanSpec>("<root>");
spec->addAllChildFields(*rowType);
auto* pathSpec = spec->childByName("path");
pathSpec->setFilter(std::make_unique<common::BytesValues>(std::vector<std::string>{path}, false));
pathSpec->setProjectOut(false);
uint64_t recordCount; | ||
uint64_t fileSizeInBytes; | ||
// The field ids for the delete columns for equality delete files | ||
std::vector<int32_t> equalityFieldIds; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why mixing up the contents of delete files with different types? Can we provide different DeleteFileReader
which returns corresponding data structure for delete, like bitmap for positional delete or maybe hashtable for equality delete? and maybe this can be made common for other data sources like delta lake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this proposal. Looking at the code, I don't see why we need a separate IcebergSplitReader, etc.
We need a DeleteFileReader to the SplitReader and implement an IcebergDeleteFileReader
.
The DeleteFileReader will return positional or equality deletes in a format independent of the table format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, so only the delta file readers are implementation specific and the SplitReader
logic is shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why mixing up the contents of delete files with different types?
@yma11 It's not mixing for different types. This IcebergDeleteFile is just what Presto passes as the metadata for a delete file, not the file or the reader itself. It contains information on both types and is passed to Velox through the split. THere will be separate delete file reader classes. The current DeleteFileReader is for positional delete, and there will be an EqualityDeleteFileReader class that reads equality delete file. I named it as DeleteFileReader because I want to subclass it to PositionalDeleteFileReader EqualityDeleteFileReader in the next PR, with the hope to reuse the logic of creating Reader and RowReaders. Or we can extract the common logic and make them 2 separate classes.
namespace facebook::velox::connector::hive::iceberg { | ||
|
||
enum class FileContent { | ||
kData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any place for FileContent
to be used for normal data file? I think we may don't need it if it's for delete file type identification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet. I kept it there for future use. I can remove it for now.
In this PR we introduces IcebergSplitReader which supports reading Iceberg splits with positional delete files.
This PR is to replace #5897
TODO: