Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration #10926

stevenzwu · 2024-08-13T04:12:58Z

This can happen in the FileIOParser serialization and deserialization scenario. FileIOParser doesn't serialize and deserialize the Hadoop configuration. the deserialized io object is not valid because hadoopConf is null. NPE would be throw if newInputFile is called on deserialized object.

This is a blocker for Flink to switch to JSON parser for scan tasks of metadata tables (like manifest etc.)

stevenzwu · 2024-08-13T04:16:39Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

+  public HadoopFileIO() {
+    // Create a default hadoopConf as it is required for the object to be valid.
+    // E.g. newInputFile would throw NPE with hadoopConf.get() otherwise.
+    this.hadoopConf = new SerializableConfiguration(new Configuration())::get;


FileIOParser doesn't serialize and deserialize the Hadoop configuration. the deserialized io object is not valid because hadoopConf is null. NPE would be throw if newInputFile is called on deserialized object.

not sure if we want to change FileIOParser to serialize Hadoop Configuration object or not. Regardless, I feel this change seems reasonable, as this class requires hadoopConf to not be null.

I thought about this problem a bit more. The one downside I see with this change is that HadoopFileIO will typically be instantiated via CatalogUtil.loadFileIO, which will then set the Conf. Also the javadoc of the constructor mentions that the conf must be set by calling setConf().

Instead of serializing the conf in the Parser, maybe the conf should be initialized in FileIOParser.fromJson(String) instead of passing a null conf in

iceberg/core/src/main/java/org/apache/iceberg/io/FileIOParser.java

Line 68 in 7fe4037

return fromJson(json, null);

? I think that would solve the issue while also not instantiating a new Conf every time a new instance of HadoopFileIO is created?

if the FileIO is Configurable, create a default Configuration object and call setConf(). is that what you mean? that will add Hadoop class dependency to the FileIOParser though.

I have also thought about implementing the serialization of Hadoop Configuration in the FileIOParser. but it will add some complexity to the FileIOParser and we likely won't achieve 100% fidelity with the original object. E.g., we may only serialize the key-value string pairs from the configuration as a JSON object. I will also have the problem of adding Hadoop class dependency to the FileIOParser.

I also thought about use this method from SerializationUtil? But it is Java serialization (not JSON serialization).

/** * Serialize an object to bytes. If the object implements {@link HadoopConfigurable}, its Hadoop * configuration will be serialized into a {@link SerializableConfiguration}. * * @param obj object to serialize * @return serialized bytes */ public static byte[] serializeToBytes(Object obj) { return serializeToBytes(obj, conf -> new SerializableConfiguration(conf)::get); }

ah you're right that this will add a Hadoop dependency and we don't want that.
Rather than setting a conf in the constructor, maybe we should have some handling in conf(), such as

public Configuration conf() { - return hadoopConf.get(); + return Optional.ofNullable(hadoopConf).map(Supplier::get).orElseGet(Configuration::new); }

We would need to adjust all the places that call hadoopConf.get() to call conf() after this change. That way we wouldn't need to unnecessarily instantiate a conf in the constructor when we don't have to, wdyt?

conf() method is fine.

FileIOParser doesn't faithfully serialize and deserialize the HadoopFileIO (Hadoop configuration is not carried over). The deserialized HadoopFileIO may miss important configs, which can be a problem.

from CatalogUtil, there are 3 components defining a FileIO. FileIOParser is missing the conf.

FileIO loadFileIO(String impl, Map<String, String> properties, Object hadoopConf)

I am wondering if we should change FileIOParser to serialize and deserialize Hadoop Configuration when the FileIO is HadoopConfigurable. We can probably only serialize the key-value string pairs from the Configuration as a JSON object (kind of a read only copy).

BTW, ResolvingFileIO and HadoopConfigurable also have Hadoop class dependency. There was a discussion of potentially moving HadoopCatalog to a separate iceberg-hadoop module. I guess we can't move ResolvingFileIO and HadoopConfigurable then.

I have seen that we use new Configuration(false) in the code, so we allow for the user to provide a trimmed configuration, and in this case the serialized config is quite small for the binary serialization. We might have to do something similar for the JSON serialization.

Are you saying trim the key-value pairs with the entries from default configuration as new Configuration(false). We can potentially do that. but it still has the implication that it depends on the runtime env. if the other side (deserialization) has a different env (default config), this can be different.

The ManifestListReadTask.rows() and ManifestListReadTask.file() is using the io to get the new input file like io.newInputFile(manifestListLocation).

FileIO is used to read from manifest file. ManifestFiles.read is a widely used API.

private CloseableIterable<? extends ContentFile<?>> files(Schema fileProjection) { switch (manifest.content()) { case DATA: return ManifestFiles.read(manifest, io, specsById).project(fileProjection); case DELETES: return ManifestFiles.readDeleteManifest(manifest, io, specsById).project(fileProjection); ... }

FileIO serialization will become even more complicated when the manifest file encryption arrives here. We will need to apply the encryption for the FileIO

From the usage, EncryptingFileIO seems to be used only as a derived object. it is not part of the table ops state. It shouldn't need to be serialized. only original FileIO and EncryptionManager need to be serialized.

protected EncryptedOutputFile newManifestOutputFile() { String manifestFileLocation = ops.metadataFileLocation( FileFormat.AVRO.addExtension(commitUUID + "-m" + manifestCount.getAndIncrement())); return EncryptingFileIO.combine(ops.io(), ops.encryption()) .newEncryptingOutputFile(manifestFileLocation); }

but it still has the implication that it depends on the runtime env. if the other side (deserialization) has a different env (default config), this can be different.

Yes, but I think (not tested) we suffer from the same issue, if the configuration provided through the Catalog is created as Configuration(false).

FileIO is used to read from manifest file. ManifestFiles.read is a widely used API.

Yes, but this case the reader provides the FileIO independently from the Task.

only original FileIO and EncryptionManager need to be serialized.

Yes. Encryption should be handled

@pvary if I understand correctly, your point is that FileIO probably shouldn't be part of the task state for BaseEntriesTable.ManifestReadTask, AllManifestsTable.ManifestListReadTask? Then we don't need to serialize FileIO for those manifest related tasks. That would be a fair question. It seems like a large/challenging refactoring. Looking for other folks' take on this.

Regardless, the issue remains that FileIOParser doesn't serialize HadoopFileIO faithfully. I don't know if REST catalog has any need to use it to JSON serialize FileIO in the future.

@stevenzwu:

if I understand correctly, your point is that FileIO probably shouldn't be part of the task state for BaseEntriesTable.ManifestReadTask, AllManifestsTable.ManifestListReadTask? [..] It seems like a large/challenging refactoring. Looking for other folks' take on this.

Yes, this seems strange, and problematic to me. I was not able to find an easy solution yet. I was hoping, others with better knowledge might have some ideas.

Regardless, the issue remains that FileIOParser doesn't serialize HadoopFileIO faithfully. I don't know if REST catalog has any need to use it to JSON serialize FileIO in the future.

My point here is that, since we use Configuration(false) in some cases, and the way how the current serialization works, we already doesn't serialize HadoopFileIO faithfully. So If we don't find a solution for getting rid of the FileIO, we might as well write our own "unfaithful" serialization which mimics the way how the current serialization works.

@pvary coming back to the size concern of serializing Hadoop Configuration, that is already the case with Java serialization for the manifest tasks. Implementing JSON serialization for Hadoop configuration is not making things worse.

whether we can avoid the need of FileIO in the manifest tasks can be a separate larger discussion.

rdblue · 2024-09-27T19:17:43Z

I don't think that we should change how this works. A Hadoop Configuration MUST be provided externally. FileIO serialization is not intended to send the entire Hadoop Configuration and should remain separate.

stevenzwu · 2024-09-27T20:15:43Z

I don't think that we should change how this works. A Hadoop Configuration MUST be provided externally.

This makes sense. We already have some consensus that the current PR needs to be updated. We just don't have a consensus on how to fix the NPE problem where deserialized HadoopFile has null Hadoop Configuration object.

FileIO serialization is not intended to send the entire Hadoop Configuration and should remain separate.

This is not super clear to me. What if users loaded the FileIO with customized/overridden properties in Hadoop Configuration object. Loading a default Hadoop Configuration object on the receiving host won't contain those overrides.

CatalogUtil.loadFileIO(impl, properties, conf)

@pvary had a related concern on the size of Hadoop Configuration entries if we serialize the all properties (most of them are default properties loaded from host). What if we just serialize the overridden properties (assuming the sender and receiver side have the same host level Hadoop conf xml)?

Anyway, the current NPE problem with deserialized HadoopFileIO need to be fixed somehow.

@rdblue, should we adopt @nastra 's suggestion of loading Hadoop Configuration in FileIOParser? The only downside is class dependency to Hadoop jars, which could be a problem if we want to move all Hadoop related classes to a separate iceberg-hodoop module. then the FileIOParser can't live in iceberg-core anymore.

  public static FileIO fromJson(String json) {
-    return fromJson(json, null);
+    return fromJson(json, new Configuration());
   }
  }

rdblue · 2024-09-30T17:51:10Z

There is already a way to pass the Configuration to the FileIOParser. It is passed as Object to avoid creating a dependency on Hadoop classes.

stevenzwu · 2024-10-07T05:36:19Z

There is already a way to pass the Configuration to the FileIOParser. It is passed as Object to avoid creating a dependency on Hadoop classes.

then we need to load the Hadoop configuration from 3 other JSON parsers for manifest and files tasks. Those classes are also in iceberg-cores

these two methods are the pair that callers use. It is probably not reasonable for callers to load Hadoop configuration when calling fromJson.

public static void toJson(FileIO io, JsonGenerator generator) throws IOException
public static FileIO fromJson(String json)

Ideally, this method should load the Hadoop Configuration object if the impl is a HadoopFileIO and the conf is null.

public static FileIO fromJson(JsonNode json, Object conf) {
    Preconditions.checkArgument(json.isObject(), "Cannot parse FileIO from non-object: %s", json);
    String impl = JsonUtil.getString(FILE_IO_IMPL, json);
    Map<String, String> properties = JsonUtil.getStringMap(PROPERTIES, json);

    if (conf == null && impl.equals(HadoopFileIO.class.getCanonicalName()) {
      conf = new Configuration();   // load the Hadoop conf from host
    }

    return CatalogUtil.loadFileIO(impl, properties, conf);
  }

stevenzwu · 2024-10-07T17:48:20Z

Let me recap the problem and potential options

problem: FileIOParser doesn't serialize Hadoop Configuration for HadoopFileIO. As a result, deserialized HadoopFileIO has NPE problem.

Option 1: FileIOParser.fromJson load the Hadoop Configuration from host if the input arg is null.

public static FileIO fromJson(JsonNode json, Object conf) {
    Preconditions.checkArgument(json.isObject(), "Cannot parse FileIO from non-object: %s", json);
    String impl = JsonUtil.getString(FILE_IO_IMPL, json);
    Map<String, String> properties = JsonUtil.getStringMap(PROPERTIES, json);

    if (conf == null && impl.equals(HadoopFileIO.class.getCanonicalName()) {
      conf = new Configuration();   // load the Hadoop conf from host
    }

    return CatalogUtil.loadFileIO(impl, properties, conf);
  }

pro: this patch requires minimal change and doesn't introduce any behavior change.
con: it would introduce Hadoop class dependency in FileIOParser, which could be problematic down the road for removing Hadoop dependency from iceberg-core.

However, I feel Hadoop class dependency inevitable today for two reasons. (1) HadoopFileIO uses both Hadoop Configuration and Map<String, String> properties, while other FileIO implementations like S3FileIO only depends on Map<String, String> properties. (2) FileIOParser is designed to be able to serialize and deserialize all FileIO implementations (including HadoopFileIO).

Personally, I would prefer to go forward with this stop-gap solution for now while we are gathering more consensus on the long-term solutions (one listed below).

Option 2: change HadoopFileIO to also uses Map<String, String> properties as the only configuration input

Internally, HadoopFileIO loads the default Hadoop Configuration from the host and apply the Map<String, String> properties as override.

pro: this would standardize the FileIO behavior to make FileIOParser JSON serialization uniform across all FileIO implementations. Note that FileIO interface only defines getter for Map properties Map<String, String> properties().
con: it is a breaking change that modifies the HadoopFileIO construction behavior.

Option 2 is a long-term solution IMO.

nastra · 2024-10-08T08:35:33Z

Let me recap the problem and potential options

problem: FileIOParser doesn't serialize Hadoop Configuration for HadoopFileIO. As a result, deserialized HadoopFileIO has NPE problem.

Option 1: FileIOParser.fromJson load the Hadoop Configuration from host if the input arg is null.

@stevenzwu for Option 1, wouldn't https://github.com/apache/iceberg/pull/10926/files#r1718243019 also solve the issue with the NPE without introducing a Hadoop dependency on the Parser itself?

stevenzwu · 2024-10-09T16:36:10Z

@stevenzwu for Option 1, wouldn't https://github.com/apache/iceberg/pull/10926/files#r1718243019 also solve the issue with the NPE without introducing a Hadoop dependency on the Parser itself?

agree it is better to avoid the Hadoop dep on the FileIOParser.

I will make a small change to your proposal to cache and set the loaded Hadoop Configuration object to hadoopConf to avoid repeated loading.

nastra · 2024-10-14T13:07:51Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

@@ -74,7 +74,7 @@ public HadoopFileIO(SerializableSupplier<Configuration> hadoopConf) {
  }

  public Configuration conf() {
-    return hadoopConf.get();
+    return getConf();


I don't think it's enough to just update this call. You'd basically need to update every place in this class that uses hadoopConf.get() to conf() or to getConf()

you are absolutely correct. sorry that I missed this earlier. updated the PR.

core/src/test/java/org/apache/iceberg/hadoop/HadoopFileIOTest.java

nastra

LGTM with one suggestion

…java Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

ashvina

I have a minor comment; otherwise, it LGTM.

ashvina · 2024-10-15T19:42:18Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopFileIO.java

+    // Create a default hadoopConf as it is required for the object to be valid.
+    // E.g. newInputFile would throw NPE with getConf() otherwise.
+    if (hadoopConf == null) {
+      this.hadoopConf = new SerializableConfiguration(new Configuration())::get;


Is there a potential for a race condition here? Do you see a need to add synchronization to protect this instantiation?

yeah, FileIO can potentially be used by multiple threads. let me add synchronization.

stevenzwu · 2024-10-17T21:32:03Z

thanks @nastra @pvary @Fokko @rdblue @ashvina for the review

…opFileIO because FileIOParser doesn't serialize Hadoop configuration (apache#10926) Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

stevenzwu requested a review from nastra August 13, 2024 04:12

github-actions bot added the core label Aug 13, 2024

stevenzwu commented Aug 13, 2024

View reviewed changes

stevenzwu force-pushed the fix-hadoop-fileio-initialize branch 2 times, most recently from 3618660 to 8b0bfbe Compare August 13, 2024 17:56

stevenzwu changed the title ~~Core: create an empty Hadoop config if not provided in constructor~~ Core: create an default Hadoop config if not provided in constructor Aug 13, 2024

stevenzwu changed the title ~~Core: create an default Hadoop config if not provided in constructor~~ Core: create a default Hadoop config if not provided in constructor Aug 13, 2024

stevenzwu mentioned this pull request Aug 13, 2024

Core: add JSON serialization for BaseFilesTable.ManifestReadTask, AllManifestsTable.ManifestListReadTask, and BaseEntriesTable.ManifestReadTask #10735

Merged

stevenzwu changed the title ~~Core: create a default Hadoop config if not provided in constructor~~ Core: fix NPE with HadoopFileIO with Hadoop conf is not set Aug 16, 2024

stevenzwu changed the title ~~Core: fix NPE with HadoopFileIO with Hadoop conf is not set~~ Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration Aug 28, 2024

stevenzwu added this to the Iceberg 1.7.0 milestone Aug 29, 2024

Core: create an empty Hadoop config if not provided in constructor

Verified

This commit was signed with the committer’s verified signature.

mislav Mislav Marohnić

SSH Key Fingerprint: EbViaI6UYRSsKgZLAjbeGSj64iqErDtQV8FnFCgfw78 Learn about vigilant mode

1dfcf72

stevenzwu force-pushed the fix-hadoop-fileio-initialize branch from 8b0bfbe to 1dfcf72 Compare October 9, 2024 16:50

nastra reviewed Oct 14, 2024

View reviewed changes

switch all hadoopConf.get() to getConf()

55dd00b

nastra reviewed Oct 15, 2024

View reviewed changes

core/src/test/java/org/apache/iceberg/hadoop/HadoopFileIOTest.java Outdated Show resolved Hide resolved

nastra approved these changes Oct 15, 2024

View reviewed changes

Update core/src/test/java/org/apache/iceberg/hadoop/HadoopFileIOTest.…

e360aa2

…java Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>

ashvina approved these changes Oct 15, 2024

View reviewed changes

stevenzwu added 2 commits October 15, 2024 13:39

fix style

3797a97

use double checked lock to lazily initialize hadoopConf

edda652

pvary approved these changes Oct 17, 2024

View reviewed changes

stevenzwu merged commit f4ffe13 into apache:main Oct 17, 2024
49 checks passed

stevenzwu deleted the fix-hadoop-fileio-initialize branch October 17, 2024 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration #10926

Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration #10926

stevenzwu commented Aug 13, 2024 •

edited

Loading

stevenzwu Aug 13, 2024

nastra Aug 14, 2024

stevenzwu Aug 15, 2024 •

edited

Loading

nastra Aug 15, 2024

stevenzwu Aug 15, 2024

stevenzwu Aug 29, 2024 •

edited

Loading

pvary Aug 30, 2024

stevenzwu Sep 4, 2024

pvary Sep 9, 2024

stevenzwu Sep 30, 2024

rdblue commented Sep 27, 2024

stevenzwu commented Sep 27, 2024 •

edited

Loading

rdblue commented Sep 30, 2024

stevenzwu commented Oct 7, 2024

stevenzwu commented Oct 7, 2024

nastra commented Oct 8, 2024

stevenzwu commented Oct 9, 2024 •

edited

Loading

nastra Oct 14, 2024

stevenzwu Oct 14, 2024

nastra left a comment

ashvina left a comment

ashvina Oct 15, 2024

stevenzwu Oct 15, 2024

stevenzwu commented Oct 17, 2024

Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration #10926

Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration #10926

Conversation

stevenzwu commented Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Sep 27, 2024

stevenzwu commented Sep 27, 2024 • edited Loading

rdblue commented Sep 30, 2024

stevenzwu commented Oct 7, 2024

stevenzwu commented Oct 7, 2024

nastra commented Oct 8, 2024

stevenzwu commented Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nastra left a comment

Choose a reason for hiding this comment

ashvina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevenzwu commented Oct 17, 2024

stevenzwu commented Aug 13, 2024 •

edited

Loading

stevenzwu Aug 15, 2024 •

edited

Loading

stevenzwu Aug 29, 2024 •

edited

Loading

stevenzwu commented Sep 27, 2024 •

edited

Loading

stevenzwu commented Oct 9, 2024 •

edited

Loading