Core: add JSON serialization for BaseFilesTable.ManifestReadTask, AllManifestsTable.ManifestListReadTask, and BaseEntriesTable.ManifestReadTask #10735

stevenzwu · 2024-07-21T04:50:52Z

This completes the JSON parser for scan task. These scan tasks are for metadata tables.

This would unblock Flink to switch to FLIP-27 source as the default. Flink unit tests pass with FLIP-27 as default, except for one test with limit clause (which I would follow up separately).

close issue #9597 .

stevenzwu · 2024-07-21T05:24:34Z

core/src/main/java/org/apache/iceberg/AllManifestsTable.java

@@ -158,12 +168,14 @@ static class ManifestListReadTask implements DataTask {
    private DataFile lazyDataFile = null;

    ManifestListReadTask(
+        Schema dataTableSchema,


needs the table schema to be able to parse partition spec.

the other arg schema is for the MANIFEST_FILE_SCHEMA.

stevenzwu · 2024-07-21T05:25:19Z

core/src/main/java/org/apache/iceberg/BaseEntriesTable.java

@@ -304,11 +306,6 @@ static class ManifestReadTask extends BaseFileScanTask implements DataTask {
              : new Schema();
    }

-    @VisibleForTesting


not just for testing anymore. JSON parser needs this getter too

stevenzwu · 2024-07-21T05:27:51Z

core/src/main/java/org/apache/iceberg/BaseEntriesTable.java

    ManifestReadTask(
-        Table table,


did a bit refactoring. this constructor can be used by the JSON parser. the other constructor above is used for the previous purpose in the planFiles method above.

stevenzwu · 2024-07-21T05:29:55Z

core/src/main/java/org/apache/iceberg/GenericManifestFile.java

@@ -105,6 +105,42 @@ public GenericManifestFile(Schema avroSchema) {
    this.keyMetadata = null;
  }

+  /** Adjust the arg order to avoid conflict with the public constructor below */
+  GenericManifestFile(


used by ManifestFileParser introduced in this PR

stevenzwu · 2024-07-21T05:32:17Z

core/src/main/java/org/apache/iceberg/ScanTaskParser.java

+    DATA_TASK("data-task"),
+    FILES_TABLE_TASK("files-table-task"),
+    ALL_MANIFESTS_TABLE_TASK("all-manifests-table-task"),
+    MANIFEST_ENTRIES_TABLE_TASK("manifest-entries-task");


This is for the BaseEntriesTable task, which is the base class for ManifestEntriesTable and AllEntriesTable. Don't want to call it base-entries-task as it wouldn't be clear what kind of task.

stevenzwu · 2024-07-21T05:34:10Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

@@ -71,6 +72,13 @@ public HadoopFileIO(Configuration hadoopConf) {

  public HadoopFileIO(SerializableSupplier<Configuration> hadoopConf) {
    this.hadoopConf = hadoopConf;
+    Map<String, String> props = Maps.newHashMapWithExpectedSize(hadoopConf.get().size());


These bugs were discovered during serialization test (don't remember in core module or flink module)

can we extract this into a separate PR maybe (with separate tests)?

#10926

will revert the change in this PR

RussellSpitzer · 2024-07-22T16:17:23Z

.palantir/revapi.yml

@@ -874,6 +874,10 @@ acceptedBreaks:
      justification: "Static utility class - should not have public constructor"
  "1.4.0":
    org.apache.iceberg:iceberg-core:
+    - code: "java.class.defaultSerializationChanged"


This is in the wrong version isn't it?

great catch. the initial change and grade revapiAcceptBreak were run a few months ago. will need to re-run it maybe after 1.16 release.

@stevenzwu can you please rebase the PR and make sure to run the rev api task that will automatically add this to the correct version in this file?

the diff in the file is still messed up

fixed. I first copied the revapi.yml file from latest main branch. then applied the revapiAcceptBreak. There was a couple additional diffs from the revapi action.

➜ iceberg git:(issue-9597-manifest-task) ✗ ./gradlew :iceberg-core:revapiAcceptBreak --justification "Serialization across versions is not supported" \ --code "java.class.defaultSerializationChanged" \ --old "class org.apache.iceberg.GenericManifestFile" \ --new "class org.apache.iceberg.GenericManifestFile"

RussellSpitzer · 2024-07-22T16:23:27Z