Introducing IcebergSplitReader #7362

yingsu00 · 2023-11-01T08:12:05Z

In this PR we introduces IcebergSplitReader which supports reading Iceberg splits with positional delete files.

This PR is to replace #5897

TODO:

Delete file deserialization
Add filter on base file path when reading the delete file

netlify · 2023-11-01T08:12:10Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`6edc1db`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/6554473e4e834d000876878e

liujiayi771 · 2023-11-02T03:05:18Z

velox/connectors/hive/iceberg/IcebergSplit.cpp

+          _length,
+          _partitionKeys,
+          _tableBucketNumber) {
+  // TODO: Deserialize _extraFileInfo to get deleteFiles;


Will the relevant information of deleteFile be obtained through string deserialization? What are the benefits of doing this?

I think the best way is to deserialize the incoming split to IcebergSplit with delete files directly, so we don't have to do deserialization from _extraFileInfo again. But this interface was what's decided by Meta.

I see. Thanks.

Recently Arrow lib was partially moved into Parquet writer. Some of the Parquet reader files still depends on some Arrow classes which will be removed in the near future. The namespaces are now moved from arrow:: to ::arrow because it is still depending on the third_party arrow_ep installation and this commit fixes it to avoid build failures.

RowContainerSortBenchmark depends on Parquet reader by default, but Parquet support is controled by VELOX_ENABLE_PARQUET and it's defaulted to OFF. This caused build failures because it tries to include some Arrow headers which is not installed if VELOX_ENABLE_PARQUET is off. This commit guard this benchmark by checking VELOX_ENABLE_PARQUET first.

S3InsertTest.cpp used to include the ParquetReader.h, and caused build failures if VELOX_ENABLE_PARQUET and VELOX_ENABLE_ARROW were not turned on. However the include of ParquetReader.h is not needed. This commit cleans up the includes for S3InsertTest.cpp

The upcoming IcebergSplitReader will need to use fileHandleFactory_, executor_, connectorQueryCtx_, etc because it needs to create another HiveDataSource to read the delete files. This PR copies these required fields to SplitReader. Moreover, since the SplitReader already owns the baseReader_, the creation and configuration of ReaderOptions was also moved to SplitReader in a single function configureReaderOptions(). Previously the configuration of ReaderOptions was scattered at multiple locations in HiveDataSource.

In this commit we introduces IcebergSplitReader which supports reading Iceberg splits with positional delete files.

Yuhta · 2023-11-17T21:36:35Z

velox/connectors/hive/iceberg/DeleteFileReader.h

+  ConnectorQueryCtx* const connectorQueryCtx_;
+  uint64_t splitOffset_;
+
+  std::unique_ptr<HiveDataSource> deleteDataSource_;


@yingsu00 Please use a dwio::common::Reader instead of the whole data source here. There are a lot of stuff you don't need in a delta file reader: column manipulation, remaining filter, split prefetch, etc.

@Yuhta Thanks for reviewing. Yes I can do that. I'll move the util functions like makeScanSpec, getColumnName, etc to a new HiveReaderUtil class, so that they can be shared by the HiveDataSource and the DeleteFileReader.

That sounds good. Curious why do you need makeScanSpec though? You just need a scan spec to read all rows and all columns, no?

@Yuhta For positional deletes, we will need to build subfield filters on the path column, as an Iceberg positional delete file can contain deleted rows in multiple base files. Will also add filter on the delete row column, since it's sorted and the lower/upperbound are known, so that perf gets better.
For equality deletes, not every column needs to be read sometimes.

You don't need makeScanSpec for these, just construct the ScanSpec yourselves (see below code example), it is very small and controllable. I did that for metalake. makeScanSpec is more for translating the SubfieldFilters and prunings from Presto representation.

auto spec = std::make_shared<common::ScanSpec>("<root>"); spec->addAllChildFields(*rowType); auto* pathSpec = spec->childByName("path"); pathSpec->setFilter(std::make_unique<common::BytesValues>(std::vector<std::string>{path}, false)); pathSpec->setProjectOut(false);

yma11 · 2023-11-20T14:30:36Z

velox/connectors/hive/iceberg/IcebergDeleteFile.h

+  uint64_t recordCount;
+  uint64_t fileSizeInBytes;
+  // The field ids for the delete columns for equality delete files
+  std::vector<int32_t> equalityFieldIds;


Why mixing up the contents of delete files with different types? Can we provide different DeleteFileReader which returns corresponding data structure for delete, like bitmap for positional delete or maybe hashtable for equality delete? and maybe this can be made common for other data sources like delta lake?

I like this proposal. Looking at the code, I don't see why we need a separate IcebergSplitReader, etc.
We need a DeleteFileReader to the SplitReader and implement an IcebergDeleteFileReader.
The DeleteFileReader will return positional or equality deletes in a format independent of the table format.

I think this is a good idea, so only the delta file readers are implementation specific and the SplitReader logic is shared.

Why mixing up the contents of delete files with different types?

@yma11 It's not mixing for different types. This IcebergDeleteFile is just what Presto passes as the metadata for a delete file, not the file or the reader itself. It contains information on both types and is passed to Velox through the split. THere will be separate delete file reader classes. The current DeleteFileReader is for positional delete, and there will be an EqualityDeleteFileReader class that reads equality delete file. I named it as DeleteFileReader because I want to subclass it to PositionalDeleteFileReader EqualityDeleteFileReader in the next PR, with the hope to reuse the logic of creating Reader and RowReaders. Or we can extract the common logic and make them 2 separate classes.

yma11 · 2023-11-20T14:32:09Z

velox/connectors/hive/iceberg/IcebergDeleteFile.h

+namespace facebook::velox::connector::hive::iceberg {
+
+enum class FileContent {
+  kData,


Will there be any place for FileContent to be used for normal data file? I think we may don't need it if it's for delete file type identification.

not yet. I kept it there for future use. I can remove it for now.

yingsu00 · 2023-12-16T12:07:03Z

@Yuhta #7847 is the new PR that uses RowReader instead of HiveDataSource. Can you please review that one? I kept both open just for your reference. Thanks!

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 1, 2023

yingsu00 force-pushed the iceberg7 branch 2 times, most recently from 801e2ce to e8b35ab Compare November 2, 2023 02:58

liujiayi771 reviewed Nov 2, 2023

View reviewed changes

yingsu00 force-pushed the iceberg7 branch from e8b35ab to 8e5bb1a Compare November 4, 2023 07:29

assignUser mentioned this pull request Nov 7, 2023

Support Iceberg positional deletes #5897

Closed

yingsu00 force-pushed the iceberg7 branch 12 times, most recently from 9201e84 to 88c9128 Compare November 11, 2023 05:41

yingsu00 added 4 commits November 10, 2023 21:41

yingsu00 force-pushed the iceberg7 branch from 88c9128 to f67cd1f Compare November 11, 2023 05:46

yingsu00 marked this pull request as ready for review November 13, 2023 04:28

yingsu00 requested a review from Yuhta November 13, 2023 04:28

yingsu00 force-pushed the iceberg7 branch from f67cd1f to 2abbfc3 Compare November 15, 2023 02:52

yingsu00 added 2 commits November 14, 2023 20:03

Introducing IcebergSplitReader

f44ad01

In this commit we introduces IcebergSplitReader which supports reading Iceberg splits with positional delete files.

Add subfield filter for the delete file path column

6edc1db

yingsu00 force-pushed the iceberg7 branch from 2abbfc3 to 6edc1db Compare November 15, 2023 04:21

Yuhta reviewed Nov 17, 2023

View reviewed changes

yma11 reviewed Nov 20, 2023

View reviewed changes

Yuhta mentioned this pull request Dec 14, 2023

Support reading Iceberg positional delete files #7847

Closed

Yuhta closed this Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing IcebergSplitReader #7362

Introducing IcebergSplitReader #7362

yingsu00 commented Nov 1, 2023 •

edited

Loading

netlify bot commented Nov 1, 2023 •

edited

Loading

liujiayi771 Nov 2, 2023

yingsu00 Nov 2, 2023

liujiayi771 Nov 2, 2023

Yuhta Nov 17, 2023

yingsu00 Nov 19, 2023

Yuhta Nov 20, 2023 •

edited

Loading

yingsu00 Nov 25, 2023 •

edited

Loading

Yuhta Dec 1, 2023 •

edited

Loading

yma11 Nov 20, 2023

majetideepak Nov 27, 2023 •

edited

Loading

Yuhta Dec 1, 2023

yingsu00 Dec 6, 2023

yma11 Nov 20, 2023

yingsu00 Dec 6, 2023

yingsu00 commented Dec 16, 2023

Introducing IcebergSplitReader #7362

Introducing IcebergSplitReader #7362

Conversation

yingsu00 commented Nov 1, 2023 • edited Loading

netlify bot commented Nov 1, 2023 • edited Loading

✅ Deploy Preview for meta-velox canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yuhta Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

yingsu00 Nov 25, 2023 • edited Loading

Choose a reason for hiding this comment

Yuhta Dec 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

majetideepak Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yingsu00 commented Dec 16, 2023

yingsu00 commented Nov 1, 2023 •

edited

Loading

netlify bot commented Nov 1, 2023 •

edited

Loading

Yuhta Nov 20, 2023 •

edited

Loading

yingsu00 Nov 25, 2023 •

edited

Loading

Yuhta Dec 1, 2023 •

edited

Loading

majetideepak Nov 27, 2023 •

edited

Loading