Iceberg/Comet integration POC #9841

huaxingao · 2024-03-01T02:34:25Z

This PR shows how I will integrate Comet with iceberg. The PR doesn't compile yet because we haven't released Comet yet, but it shows the ideas how we are going to change iceberg code to integrate Comet. Also, Comet doesn't have Spark3.5 support yet so I am doing this on 3.4, but we will add 3.5 support in Comet.

In VectorizedSparkParquetReaders.buildReader, if Comet library is available, a CometIcebergColumnarBatchReader will be created, which will use Comet batch reader to read data. We can also add a property later to control whether we want to use Comet or not.

The logic in CometIcebergVectorizedReaderBuilder is very similar to VectorizedReaderBuilder. It builds Comet column reader instead of iceberg column reader.

The delete logic in CometIcebergColumnarBatchReader is exactly the same as the one in ColumnarBatchReader. I will extract the common code and put the common code in a base class.

The main motivation of this PR is to improve performance using native execution. Comet's Parquet reader is a hybrid implementation: IO and decompression are done in the JVM while decoding is done natively. There is some performance gain from native decoding, but the gain is not much. However, by switching to the Comet Parquet reader, Comet will recognize that this is a Comet scan and will convert the Spark physical plan into a Comet plan for native execution. The major performance gain will be from this native execution.

huaxingao · 2024-03-01T02:41:26Z

cc @aokolnychyi @sunchao

...k/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java

aokolnychyi

I think this is the right direction to take. I did an initial high-level pass. Looking forward to having a Comet release soon.

...k/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java

aokolnychyi · 2024-04-16T03:57:07Z

spark/v3.4/build.gradle

    }

+    compileOnly "org.apache.comet:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.1.0-SNAPSHOT"


I assume this library will only contain the reader, not the operators.

Right. This only contains the reader.

Does it need to be Spark Version Dependent? Just wondering

We are currently doing some experiments to see if we can provide a Spark Version independent jar.

+1 for exploring that.

...ain/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnarBatchReader.java

...rk/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java

...in/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergPositionColumnReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkConfParser.java

...v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/BaseColumnBatchLoader.java

....4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometColumnReader.java

api/src/main/java/org/apache/iceberg/ReaderType.java

aokolnychyi · 2024-04-22T22:27:03Z

build.gradle

@@ -45,6 +45,7 @@ buildscript {
  }
 }

+String sparkMajorVersion = '3.4'


I hope we can soon have a snapshot for Comet jar independent of Spark to clean up deps here.
We can't have parquet module depend on a jar with any Spark deps.

spark/v3.4/build.gradle

aokolnychyi · 2024-04-22T22:27:57Z

spark/v3.4/build.gradle

    }

+    compileOnly "org.apache.comet:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.1.0-SNAPSHOT"


+1 for exploring that.

gradle.properties

aokolnychyi · 2024-04-23T00:54:35Z

...v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/BaseColumnBatchLoader.java

+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+@SuppressWarnings("checkstyle:VisibilityModifier")


These changes would require a bit more time to review. I'll do that tomorrow. I think we would want to restructure the original implementation a bit. Not a concern for now.

We would want to structure this a bit differently. Let me think more.

...rk/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkColumnarReaderFactory.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

huaxingao · 2024-04-29T16:59:05Z

@aokolnychyi I have addressed the comments. Could you please take one more look when you have a moment? Thanks a lot!

aokolnychyi · 2024-04-30T17:27:41Z

Will check today.

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/ParquetReaderType.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

aokolnychyi · 2024-04-30T19:04:36Z

...v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/BaseColumnBatchLoader.java

+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarBatch;
+
+@SuppressWarnings("checkstyle:VisibilityModifier")


We would want to structure this a bit differently. Let me think more.

...k/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

....4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometColumnReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkColumnarReaderFactory.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/BatchReadConf.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkSQLProperties.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/CometColumnReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BaseBatchReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/source/SparkBatch.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/BatchReadConf.java

cornelcreanga · 2024-06-20T14:09:01Z

@huaxingao - Hi, is the Comet Parquet reader able to support page skipping/use page indexes? -eg see #193 for the Iceberg Parquet reader initial issue.

huaxingao · 2024-06-20T15:41:53Z

@cornelcreanga Comet Parquet reader doesn't support page skipping yet

PaulLiang1 · 2024-09-04T04:13:51Z

hey @huaxingao
we are really interested in this feature, just wonder what can we help to getting this integrated?

huaxingao · 2024-09-04T04:25:24Z

@PaulLiang1 Thank you for your interest! We are currently working on a binary release of DataFusion Comet. Once the binary release is available, I will proceed with this PR.

PaulLiang1 · 2024-09-04T04:53:39Z

@huaxingao
I think we got a internal version of building DataFusion comet and publish a JAR internally.
Is there anything we can help with on that front?

Thanks

huaxingao · 2024-09-04T05:24:49Z

@PaulLiang1 Thanks! I'll check with my colleague tomorrow to find out where we are in the binary release process.

huaxingao · 2024-09-05T04:50:30Z

@PaulLiang1 We are pretty close to this and will have a binary release for Comet soon.

PaulLiang1 · 2024-09-05T05:00:49Z

@PaulLiang1 Thanks! I'll check with my colleague tomorrow to find out where we are in the binary release process.

got it, thanks for letting me know. please feel free to let us know if there is anything we could help on. thanks!

findepi · 2024-10-03T08:28:39Z

that's an interesting proposal for sure!
The PR description very nicely describes the intent of integrating of the Comet, thank you!.
It could be helpful, if it also explained the benefits of doing so.
From #9841 (comment) I infer that the main value driver for the change is performance. With that in sight, would you mind doing some comparative benchmarking between current Iceberg reader, the Comet reader and Trino's Iceberg Reader? It would be good to quantify the gains of going native.

cc @raunaqmorarka @sopel39

huaxingao · 2024-10-04T02:43:17Z

@findepi We have TPCH benchmark results for Parquet files (without iceberg) on the DataFusion Comet website. We are still working on improving performance. We don't have benchmarks for Iceberg yet. We will post the benchmarks once they are available.

findepi · 2024-10-04T11:17:08Z

The query-level benchmarks like TPCH aren't needed, I guess. It would good to know how different Parquet reader implementations compare against each other, so that we know what do we gain by embracing native.

huaxingao · 2024-10-05T21:08:43Z

@findepi Comet's Parquet reader is a hybrid implementation: IO and decompression are done in the JVM while decoding is done natively. There is some performance gain from native decoding, but the gain is not substantial. However, by switching to the Comet Parquet reader, Comet will recognize that this is a Comet scan and will convert the Spark physical plan into a Comet plan for native execution. The major performance gain will be from this native execution.

findepi · 2024-10-12T18:30:09Z

thanks @huaxingao for this additional explanation. This makes sense. Can you please replicate that information in the PR description as well? thanks!

huaxingao · 2024-10-21T21:18:47Z

@aokolnychyi We finally have a Comet binary release, and I've updated the PR to use it. Could you please take a look at the PR when you have time? Thanks a lot!

bmorck · 2024-10-28T00:55:22Z

@huaxingao Very interested in this work and thanks a bunch for taking this on! Wanted to see if this PR addresses all changes needed on the iceberg side needed to integrate comet with iceberg. For example do we also need to do things such as

Have SparkBatchQueryScan implement the org.apache.comet.parquet.SupportsComet interface?

Wanted to also see if you think this is at a place where we can try to port this change into our iceberg fork? Also wanted to see if there is anything we can do to help out here!

huaxingao · 2024-10-29T06:12:16Z

@bmorck Thanks for your interest! Currently, this PR only enables the CometBatchReader for batch reading; it does not yet turn on Comet's native operators. In the next step, I will make SparkBatchQueryScan implement the SupportsComet interface to enable all of Comet's native operators. I will proceed with the second step as soon as the review is completed.
Yes, you can port this change into your Iceberg fork, or wait a couple of weeks until my second step is complete if you want the native operators to be activated.

github-actions · 2024-11-29T00:15:50Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions bot added spark build labels Mar 1, 2024

huaxingao mentioned this pull request Mar 1, 2024

Dynamically support Spark native engine in Iceberg #9826

Closed

huaxingao mentioned this pull request Mar 5, 2024

Dynamically support Spark native engine in Iceberg #9721

Closed

sunchao mentioned this pull request Mar 7, 2024

Explore integration with Delta Lake apache/datafusion-comet#174

Open

RussellSpitzer reviewed Apr 2, 2024

View reviewed changes

...k/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometIcebergColumnReader.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Apr 16, 2024

View reviewed changes

github-actions bot added the API label Apr 18, 2024

RussellSpitzer reviewed Apr 18, 2024

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkConfParser.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 18, 2024

View reviewed changes

...v3.4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/BaseColumnBatchLoader.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Apr 18, 2024

View reviewed changes

....4/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/comet/CometColumnReader.java Outdated Show resolved Hide resolved

aokolnychyi reviewed Apr 23, 2024

View reviewed changes

github-actions bot removed the API label Apr 26, 2024

aokolnychyi reviewed Apr 30, 2024

View reviewed changes

aokolnychyi reviewed May 3, 2024

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java Outdated Show resolved Hide resolved

aokolnychyi reviewed May 9, 2024

View reviewed changes

spark/v3.4/spark/src/main/java/org/apache/iceberg/spark/BatchReadConf.java Outdated Show resolved Hide resolved

huaxingao closed this Jun 20, 2024

huaxingao reopened this Jun 20, 2024

huaxingao mentioned this pull request Sep 28, 2024

Spark partial limit push down #10943

Open

Huaxin Gao and others added 11 commits October 17, 2024 10:06

Iceberg/Comet integration

295f4e7

address comments

deff716

address comments

1aa9a33

remove unnecessary code

0841d89

address comments

b22216a

address comments

06ac803

remove unnecessary public

374cbe8

address comments

775b368

address comments

4120143

minor changes

fa5ccc8

update to use comet 0.3.0

5cd07bd

huaxingao force-pushed the comet3 branch from 7fb5579 to 5cd07bd Compare October 21, 2024 17:02

use the new Comet Utils.getColumnReader method

7d27a1d

github-actions bot added the INFRA label Oct 21, 2024

huaxingao added 2 commits October 21, 2024 12:02

change PARQUET_READER_TYPE_DEFAULT to Comet to test CometReader

925f51e

Ignore SmokeTest#testGettingStarted for now

3e96fd5

andygrove mentioned this pull request Oct 22, 2024

Add support for Iceberg apache/datafusion-comet#1028

Open

This was referenced Nov 14, 2024

Spark: Remove extra columns for ColumnBatch #11551

Open

Missing ColumnarToRow when using CometSparkToColumnar apache/datafusion-comet#1092

Open

github-actions bot added the stale label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg/Comet integration POC #9841

Iceberg/Comet integration POC #9841

huaxingao commented Mar 1, 2024 •

edited

Loading

huaxingao commented Mar 1, 2024

aokolnychyi left a comment

aokolnychyi Apr 16, 2024

huaxingao Apr 16, 2024

RussellSpitzer Apr 18, 2024

huaxingao Apr 21, 2024

aokolnychyi Apr 22, 2024

aokolnychyi Apr 22, 2024

aokolnychyi Apr 22, 2024

aokolnychyi Apr 23, 2024

aokolnychyi Apr 30, 2024

huaxingao commented Apr 29, 2024

aokolnychyi commented Apr 30, 2024

aokolnychyi Apr 30, 2024

cornelcreanga commented Jun 20, 2024

huaxingao commented Jun 20, 2024

PaulLiang1 commented Sep 4, 2024

huaxingao commented Sep 4, 2024

PaulLiang1 commented Sep 4, 2024

huaxingao commented Sep 4, 2024

huaxingao commented Sep 5, 2024

PaulLiang1 commented Sep 5, 2024

findepi commented Oct 3, 2024

huaxingao commented Oct 4, 2024

findepi commented Oct 4, 2024

huaxingao commented Oct 5, 2024

findepi commented Oct 12, 2024

huaxingao commented Oct 21, 2024

bmorck commented Oct 28, 2024

huaxingao commented Oct 29, 2024

github-actions bot commented Nov 29, 2024

		}

		compileOnly "org.apache.comet:comet-spark-spark${sparkMajorVersion}_${scalaVersion}:0.1.0-SNAPSHOT"

Iceberg/Comet integration POC #9841

Are you sure you want to change the base?

Iceberg/Comet integration POC #9841

Conversation

huaxingao commented Mar 1, 2024 • edited Loading

huaxingao commented Mar 1, 2024

aokolnychyi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Apr 29, 2024

aokolnychyi commented Apr 30, 2024

Choose a reason for hiding this comment

cornelcreanga commented Jun 20, 2024

huaxingao commented Jun 20, 2024

PaulLiang1 commented Sep 4, 2024

huaxingao commented Sep 4, 2024

PaulLiang1 commented Sep 4, 2024

huaxingao commented Sep 4, 2024

huaxingao commented Sep 5, 2024

PaulLiang1 commented Sep 5, 2024

findepi commented Oct 3, 2024

huaxingao commented Oct 4, 2024

findepi commented Oct 4, 2024

huaxingao commented Oct 5, 2024

findepi commented Oct 12, 2024

huaxingao commented Oct 21, 2024

bmorck commented Oct 28, 2024

huaxingao commented Oct 29, 2024

github-actions bot commented Nov 29, 2024

huaxingao commented Mar 1, 2024 •

edited

Loading