Spark: Remove extra columns for ColumnBatch #11551

huaxingao · 2024-11-14T16:57:41Z

In Equality Delete, we build ColumnarBatchReader for the equality delete filter columns to read their values and determine which rows are deleted. If these filter columns are not among the requested columns, they are considered extra and should be removed before returning the ColumnBatch to Spark.

Suppose the table schema includes C1, C2, C3, C4, C5. If the query is: SELECT C5 FROM table, and the equality delete filter is on C3 and C4,

We read the values of C3 and C4 to identify which rows are deleted. However, we do not want to include these values in the ColumnBatch that we return to Spark.

huaxingao · 2024-11-14T17:09:14Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

@@ -622,6 +624,41 @@ public void testPosDeletesOnParquetFileWithMultipleRowGroups() throws IOExceptio
    assertThat(rowSet(tblName, tbl, "*")).hasSize(193);
  }

+  @TestTemplate
+  public void testEqualityDeleteWithDifferentScanAndDeleteColumns() throws IOException {


This test is expected to pass even without the fix provided by this PR. Currently, the extra columns returned to Spark do not cause any problems. However, with Comet native execution, since Comet allocates arrays in a pre-allocated list and relies on the requested schema to determine the number of columns in the batch, this test would fail without the fix proposed in this PR.

Is it possible to check the intermediate results? for example, checking the ColumnarBatch returned to Spark. We may avoid using comet as a dependency for the test.

I have changed the test to check the number of columns in ColumnarBatch

singhpk234 · 2024-11-14T18:33:32Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+  // is 2. Since when creating the DeleteFilter, we append these extra columns in the end of the
+  // requested schema, we can just remove them from the end of the ColumnVector.


[doubt] Is it possible to fix this at the place where these extra columns are appended in the end of the requested schema, this would probably help us in avoiding extra memory in the first place and expensive copy of Columbatch ?

Thanks for your comment!

The extra columns are appended to the requested schema in DeleteFilter.fileProjection. The values of these extra columns are read in ColumnarBatchReader and used to identify which rows are deleted in applyEqDelete. I remove the extra columns right after calling applyEqDelete.

Thank you for the response !

[doubt] Considering applyEquality delete anyways does another projection on top of the schema returned from DeleteFilte.fileProjection

iceberg/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

Line 196 in bf8d25f

Schema deleteSchema = TypeUtil.select(requiredSchema, ids);

can we not add one another param in the fileProjection like we did here to include additional fiels based on the boolean flag ? :

iceberg/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

Line 272 in 06dc721

boolean needRowPosCol) {

so that we get what columns we actually need in the first place ? to avoid removing extra columns post filter evaluation

Thanks for taking a look!
We traverse the schema and build a VectorizedReader for each of the column in VectorizedReaderBuilder, this is done before DeleteFilte.fileProjection

huaxingao · 2024-11-15T17:46:48Z

cc @flyrain @szehon-ho @viirya

flyrain

Thanks for woking on it, @huaxingao ! Left some comments.

flyrain · 2024-11-29T19:47:59Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

@@ -245,5 +259,16 @@ void applyEqDelete(ColumnarBatch columnarBatch) {

      columnarBatch.setNumRows(currentRowId);
    }
+
+    ColumnarBatch removeExtraColumnsFromColumnarBatch(


Minor suggestion: Simplify to removeExtraColumns?

Done. Thanks

flyrain · 2024-11-29T20:05:30Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

@@ -622,6 +624,41 @@ public void testPosDeletesOnParquetFileWithMultipleRowGroups() throws IOExceptio
    assertThat(rowSet(tblName, tbl, "*")).hasSize(193);
  }

+  @TestTemplate
+  public void testEqualityDeleteWithDifferentScanAndDeleteColumns() throws IOException {


Is it possible to check the intermediate results? for example, checking the ColumnarBatch returned to Spark. We may avoid using comet as a dependency for the test.

flyrain · 2024-11-29T20:07:30Z

...rk/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java

@@ -125,4 +126,25 @@ protected VectorizedReader<?> vectorizedReader(List<VectorizedReader<?>> reorder
      return reader;
    }
  }
+
+  private static int numOfExtraColumns(DeleteFilter deleteFilter) {


minor: DeleteFilter -> dDeleteFilter<InternalRow>

Fixed. Thanks

flyrain · 2024-11-29T20:09:16Z

...rk/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java

+        // For Equality Delete, the requiredColumns and expectedColumns may not be the
+        // same. For example, supposed table schema is C1, C2, C3, C4, C5, The query is:
+        // SELECT C5 FROM table, and the equality delete Filter is on C3, C4, then
+        // the requestedSchema is C5, and the required schema is C5, C3 and C4. The
+        // vectorized reader reads also need to read C3 and C4 columns to figure out
+        // which rows are deleted. However, after figuring out the deleted rows, the
+        // extra columns values are not needed to returned to Spark.
+        // Getting the numOfExtraColumns so we can remove these extra columns
+        // from ColumnBatch later.


Can we move the comment to method's Java doc?

Moved. Thanks

flyrain · 2024-11-29T20:10:53Z

...rk/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java

+        // from ColumnBatch later.
+        List<Types.NestedField> requiredColumns = deleteFilter.requiredSchema().columns();
+        List<Types.NestedField> expectedColumns = deleteFilter.requestedSchema().columns();
+        return requiredColumns.size() - expectedColumns.size();


Are we sure the extra columns are consistently appended to the end of the array?

Yes, because the extra columns are appended to the end of requestedSchema in DeleteFilter.fileProjection

flyrain · 2024-11-29T20:13:05Z

data/src/main/java/org/apache/iceberg/data/DeleteFilter.java

@@ -73,6 +74,7 @@ protected DeleteFilter(
      boolean needRowPosCol) {
    this.filePath = filePath;
    this.counter = counter;
+    this.requestedSchema = requestedSchema;


Can we be consistent with the name? requestedSchema or expectedSchema? I guess expectedSchema is more commonly used.

Changed to expectedSchema

flyrain · 2024-11-29T20:20:36Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java


-  public ColumnarBatchReader(List<VectorizedReader<?>> readers) {
+  public ColumnarBatchReader(List<VectorizedReader<?>> readers, int numExtraCol) {


Can we use the DeleteFilter(field deletes) within the class, so that no extra parameter is required? We could move the method numOfExtraColumns to this class in that case.

Changed. Thanks

flyrain

Thanks, @huaxingao, for the update! The changes look great overall. I've left a minor refactoring suggestion for consideration. 😊

flyrain · 2024-12-02T04:45:25Z

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java

+          int numOfCols = columnarBatch.numCols();
+          assertThat(numOfCols).as("Number of columns").isEqualTo(1);


Minor: also check the column type to make sure dt is removed like following?

// only the expected column(id) is kept assertThat(columnarBatch.numCols()).as("Number of columns").isEqualTo(1); assertThat(columnarBatch.column(0).dataType()).as("Column type").isEqualTo(IntegerType);

Added. Thanks!

flyrain · 2024-12-02T05:17:09Z

...k/v3.5/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java

+      int numOfExtraColumns = numOfExtraColumns(deletes);
+      if (numOfExtraColumns > 0) {
+        int newLength = arrowColumnVectors.length - numOfExtraColumns;
+        // In DeleteFilter.fileProjection, the columns for missingIds (the columns required
+        // for equality delete or ROW_POSITION) are appended to the end of the expectedSchema.
+        // Therefore, these extra columns can be removed from the end of arrowColumnVectors.
+        ColumnVector[] newColumns = Arrays.copyOf(arrowColumnVectors, newLength);
+        return new ColumnarBatch(newColumns, columnarBatch.numRows());
+      } else {
+        return columnarBatch;
+      }
+    }


How about refactor it like this, so that method numOfExtraColumns is not needed? Can we also move all related comments to this method's Java doc?

int expectedColumnSize = deletes.expectedSchema().columns().size(); if (arrowColumnVectors.length > expectedColumnSize) { ColumnVector[] newColumns = Arrays.copyOf(arrowColumnVectors, expectedColumnSize); return new ColumnarBatch(newColumns, columnarBatch.numRows()); } else { return columnarBatch; }

Good suggestion! Changed.

Spark: Remove extra columns for ColumnBatch

06dc721

github-actions bot added spark data labels Nov 14, 2024

huaxingao commented Nov 14, 2024

View reviewed changes

singhpk234 reviewed Nov 14, 2024

View reviewed changes

flyrain reviewed Nov 29, 2024

View reviewed changes

huaxingao added 2 commits November 30, 2024 23:02

address comments

e520bb3

minor

feed4e2

flyrain reviewed Dec 2, 2024

View reviewed changes

address comments

995f577

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Remove extra columns for ColumnBatch #11551

Spark: Remove extra columns for ColumnBatch #11551

huaxingao commented Nov 14, 2024

huaxingao Nov 14, 2024

flyrain Nov 29, 2024

huaxingao Dec 1, 2024

singhpk234 Nov 14, 2024 •

edited

Loading

huaxingao Nov 15, 2024

singhpk234 Nov 18, 2024 •

edited

Loading

huaxingao Nov 18, 2024

huaxingao commented Nov 15, 2024

flyrain left a comment

flyrain Nov 29, 2024

huaxingao Dec 1, 2024

flyrain Nov 29, 2024

flyrain Nov 29, 2024

huaxingao Dec 1, 2024

flyrain Nov 29, 2024

huaxingao Dec 1, 2024

flyrain Nov 29, 2024

huaxingao Dec 1, 2024

flyrain Nov 29, 2024

huaxingao Dec 1, 2024

flyrain Nov 29, 2024

huaxingao Dec 1, 2024

flyrain left a comment

flyrain Dec 2, 2024

huaxingao Dec 2, 2024

flyrain Dec 2, 2024

huaxingao Dec 2, 2024

		// is 2. Since when creating the DeleteFilter, we append these extra columns in the end of the
		// requested schema, we can just remove them from the end of the ColumnVector.


		public ColumnarBatchReader(List<VectorizedReader<?>> readers) {
		public ColumnarBatchReader(List<VectorizedReader<?>> readers, int numExtraCol) {

		int numOfCols = columnarBatch.numCols();
		assertThat(numOfCols).as("Number of columns").isEqualTo(1);

Spark: Remove extra columns for ColumnBatch #11551

Are you sure you want to change the base?

Spark: Remove extra columns for ColumnBatch #11551

Conversation

huaxingao commented Nov 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpk234 Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpk234 Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Nov 15, 2024

flyrain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flyrain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

singhpk234 Nov 14, 2024 •

edited

Loading

singhpk234 Nov 18, 2024 •

edited

Loading