Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration #10926

Merged
merged 5 commits into from
Oct 17, 2024

Conversation

stevenzwu
Copy link
Contributor

@stevenzwu stevenzwu commented Aug 13, 2024

This can happen in the FileIOParser serialization and deserialization scenario. FileIOParser doesn't serialize and deserialize the Hadoop configuration. the deserialized io object is not valid because hadoopConf is null. NPE would be throw if newInputFile is called on deserialized object.

This is a blocker for Flink to switch to JSON parser for scan tasks of metadata tables (like manifest etc.)

@stevenzwu stevenzwu requested a review from nastra August 13, 2024 04:12
@github-actions github-actions bot added the core label Aug 13, 2024
public HadoopFileIO() {
// Create a default hadoopConf as it is required for the object to be valid.
// E.g. newInputFile would throw NPE with hadoopConf.get() otherwise.
this.hadoopConf = new SerializableConfiguration(new Configuration())::get;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileIOParser doesn't serialize and deserialize the Hadoop configuration. the deserialized io object is not valid because hadoopConf is null. NPE would be throw if newInputFile is called on deserialized object.

not sure if we want to change FileIOParser to serialize Hadoop Configuration object or not. Regardless, I feel this change seems reasonable, as this class requires hadoopConf to not be null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this problem a bit more. The one downside I see with this change is that HadoopFileIO will typically be instantiated via CatalogUtil.loadFileIO, which will then set the Conf. Also the javadoc of the constructor mentions that the conf must be set by calling setConf().

Instead of serializing the conf in the Parser, maybe the conf should be initialized in FileIOParser.fromJson(String) instead of passing a null conf in

? I think that would solve the issue while also not instantiating a new Conf every time a new instance of HadoopFileIO is created?

Copy link
Contributor Author

@stevenzwu stevenzwu Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the FileIO is Configurable, create a default Configuration object and call setConf(). is that what you mean? that will add Hadoop class dependency to the FileIOParser though.

I have also thought about implementing the serialization of Hadoop Configuration in the FileIOParser. but it will add some complexity to the FileIOParser and we likely won't achieve 100% fidelity with the original object. E.g., we may only serialize the key-value string pairs from the configuration as a JSON object. I will also have the problem of adding Hadoop class dependency to the FileIOParser.

I also thought about use this method from SerializationUtil? But it is Java serialization (not JSON serialization).

  /**
   * Serialize an object to bytes. If the object implements {@link HadoopConfigurable}, its Hadoop
   * configuration will be serialized into a {@link SerializableConfiguration}.
   *
   * @param obj object to serialize
   * @return serialized bytes
   */
  public static byte[] serializeToBytes(Object obj) {
    return serializeToBytes(obj, conf -> new SerializableConfiguration(conf)::get);
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right that this will add a Hadoop dependency and we don't want that.
Rather than setting a conf in the constructor, maybe we should have some handling in conf(), such as

   public Configuration conf() {
-    return hadoopConf.get();
+    return Optional.ofNullable(hadoopConf).map(Supplier::get).orElseGet(Configuration::new);
   }

We would need to adjust all the places that call hadoopConf.get() to call conf() after this change. That way we wouldn't need to unnecessarily instantiate a conf in the constructor when we don't have to, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conf() method is fine.

FileIOParser doesn't faithfully serialize and deserialize the HadoopFileIO (Hadoop configuration is not carried over). The deserialized HadoopFileIO may miss important configs, which can be a problem.

from CatalogUtil, there are 3 components defining a FileIO. FileIOParser is missing the conf.

FileIO loadFileIO(String impl, Map<String, String> properties, Object hadoopConf)

I am wondering if we should change FileIOParser to serialize and deserialize Hadoop Configuration when the FileIO is HadoopConfigurable. We can probably only serialize the key-value string pairs from the Configuration as a JSON object (kind of a read only copy).

BTW, ResolvingFileIO and HadoopConfigurable also have Hadoop class dependency. There was a discussion of potentially moving HadoopCatalog to a separate iceberg-hadoop module. I guess we can't move ResolvingFileIO and HadoopConfigurable then.

Copy link
Contributor Author

@stevenzwu stevenzwu Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen that we use new Configuration(false) in the code, so we allow for the user to provide a trimmed configuration, and in this case the serialized config is quite small for the binary serialization. We might have to do something similar for the JSON serialization.

Are you saying trim the key-value pairs with the entries from default configuration as new Configuration(false). We can potentially do that. but it still has the implication that it depends on the runtime env. if the other side (deserialization) has a different env (default config), this can be different.

The ManifestListReadTask.rows() and ManifestListReadTask.file() is using the io to get the new input file like io.newInputFile(manifestListLocation).

FileIO is used to read from manifest file. ManifestFiles.read is a widely used API.

    private CloseableIterable<? extends ContentFile<?>> files(Schema fileProjection) {
      switch (manifest.content()) {
        case DATA:
          return ManifestFiles.read(manifest, io, specsById).project(fileProjection);
        case DELETES:
          return ManifestFiles.readDeleteManifest(manifest, io, specsById).project(fileProjection);
...
    }

FileIO serialization will become even more complicated when the manifest file encryption arrives here. We will need to apply the encryption for the FileIO

From the usage, EncryptingFileIO seems to be used only as a derived object. it is not part of the table ops state. It shouldn't need to be serialized. only original FileIO and EncryptionManager need to be serialized.

  protected EncryptedOutputFile newManifestOutputFile() {
    String manifestFileLocation =
        ops.metadataFileLocation(
            FileFormat.AVRO.addExtension(commitUUID + "-m" + manifestCount.getAndIncrement()));
    return EncryptingFileIO.combine(ops.io(), ops.encryption())
        .newEncryptingOutputFile(manifestFileLocation);
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it still has the implication that it depends on the runtime env. if the other side (deserialization) has a different env (default config), this can be different.

Yes, but I think (not tested) we suffer from the same issue, if the configuration provided through the Catalog is created as Configuration(false).

FileIO is used to read from manifest file. ManifestFiles.read is a widely used API.

Yes, but this case the reader provides the FileIO independently from the Task.

only original FileIO and EncryptionManager need to be serialized.

Yes. Encryption should be handled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary if I understand correctly, your point is that FileIO probably shouldn't be part of the task state for BaseEntriesTable.ManifestReadTask, AllManifestsTable.ManifestListReadTask? Then we don't need to serialize FileIO for those manifest related tasks. That would be a fair question. It seems like a large/challenging refactoring. Looking for other folks' take on this.

Regardless, the issue remains that FileIOParser doesn't serialize HadoopFileIO faithfully. I don't know if REST catalog has any need to use it to JSON serialize FileIO in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu:

if I understand correctly, your point is that FileIO probably shouldn't be part of the task state for BaseEntriesTable.ManifestReadTask, AllManifestsTable.ManifestListReadTask? [..] It seems like a large/challenging refactoring. Looking for other folks' take on this.

Yes, this seems strange, and problematic to me. I was not able to find an easy solution yet. I was hoping, others with better knowledge might have some ideas.

Regardless, the issue remains that FileIOParser doesn't serialize HadoopFileIO faithfully. I don't know if REST catalog has any need to use it to JSON serialize FileIO in the future.

My point here is that, since we use Configuration(false) in some cases, and the way how the current serialization works, we already doesn't serialize HadoopFileIO faithfully. So If we don't find a solution for getting rid of the FileIO, we might as well write our own "unfaithful" serialization which mimics the way how the current serialization works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pvary coming back to the size concern of serializing Hadoop Configuration, that is already the case with Java serialization for the manifest tasks. Implementing JSON serialization for Hadoop configuration is not making things worse.

whether we can avoid the need of FileIO in the manifest tasks can be a separate larger discussion.

@stevenzwu stevenzwu force-pushed the fix-hadoop-fileio-initialize branch 2 times, most recently from 3618660 to 8b0bfbe Compare August 13, 2024 17:56
@stevenzwu stevenzwu changed the title Core: create an empty Hadoop config if not provided in constructor Core: create an default Hadoop config if not provided in constructor Aug 13, 2024
@stevenzwu stevenzwu changed the title Core: create an default Hadoop config if not provided in constructor Core: create a default Hadoop config if not provided in constructor Aug 13, 2024
@stevenzwu stevenzwu changed the title Core: create a default Hadoop config if not provided in constructor Core: fix NPE with HadoopFileIO with Hadoop conf is not set Aug 16, 2024
@stevenzwu stevenzwu changed the title Core: fix NPE with HadoopFileIO with Hadoop conf is not set Core: fix NPE with HadoopFileIO because FileIOParser doesn't serialize Hadoop configuration Aug 28, 2024
@stevenzwu stevenzwu added this to the Iceberg 1.7.0 milestone Aug 29, 2024
@rdblue
Copy link
Contributor

rdblue commented Sep 27, 2024

I don't think that we should change how this works. A Hadoop Configuration MUST be provided externally. FileIO serialization is not intended to send the entire Hadoop Configuration and should remain separate.

@stevenzwu
Copy link
Contributor Author

stevenzwu commented Sep 27, 2024

I don't think that we should change how this works. A Hadoop Configuration MUST be provided externally.

This makes sense. We already have some consensus that the current PR needs to be updated. We just don't have a consensus on how to fix the NPE problem where deserialized HadoopFile has null Hadoop Configuration object.

FileIO serialization is not intended to send the entire Hadoop Configuration and should remain separate.

This is not super clear to me. What if users loaded the FileIO with customized/overridden properties in Hadoop Configuration object. Loading a default Hadoop Configuration object on the receiving host won't contain those overrides.

CatalogUtil.loadFileIO(impl, properties, conf)

@pvary had a related concern on the size of Hadoop Configuration entries if we serialize the all properties (most of them are default properties loaded from host). What if we just serialize the overridden properties (assuming the sender and receiver side have the same host level Hadoop conf xml)?

Anyway, the current NPE problem with deserialized HadoopFileIO need to be fixed somehow.

@rdblue, should we adopt @nastra 's suggestion of loading Hadoop Configuration in FileIOParser? The only downside is class dependency to Hadoop jars, which could be a problem if we want to move all Hadoop related classes to a separate iceberg-hodoop module. then the FileIOParser can't live in iceberg-core anymore.

  public static FileIO fromJson(String json) {
-    return fromJson(json, null);
+    return fromJson(json, new Configuration());
   }
  }

@rdblue
Copy link
Contributor

rdblue commented Sep 30, 2024

There is already a way to pass the Configuration to the FileIOParser. It is passed as Object to avoid creating a dependency on Hadoop classes.

@stevenzwu
Copy link
Contributor Author

There is already a way to pass the Configuration to the FileIOParser. It is passed as Object to avoid creating a dependency on Hadoop classes.

then we need to load the Hadoop configuration from 3 other JSON parsers for manifest and files tasks. Those classes are also in iceberg-cores
image

these two methods are the pair that callers use. It is probably not reasonable for callers to load Hadoop configuration when calling fromJson.

public static void toJson(FileIO io, JsonGenerator generator) throws IOException
public static FileIO fromJson(String json)

Ideally, this method should load the Hadoop Configuration object if the impl is a HadoopFileIO and the conf is null.

public static FileIO fromJson(JsonNode json, Object conf) {
    Preconditions.checkArgument(json.isObject(), "Cannot parse FileIO from non-object: %s", json);
    String impl = JsonUtil.getString(FILE_IO_IMPL, json);
    Map<String, String> properties = JsonUtil.getStringMap(PROPERTIES, json);

    if (conf == null && impl.equals(HadoopFileIO.class.getCanonicalName()) {
      conf = new Configuration();   // load the Hadoop conf from host
    }

    return CatalogUtil.loadFileIO(impl, properties, conf);
  }

@stevenzwu
Copy link
Contributor Author

Let me recap the problem and potential options

problem: FileIOParser doesn't serialize Hadoop Configuration for HadoopFileIO. As a result, deserialized HadoopFileIO has NPE problem.

Option 1: FileIOParser.fromJson load the Hadoop Configuration from host if the input arg is null.

public static FileIO fromJson(JsonNode json, Object conf) {
    Preconditions.checkArgument(json.isObject(), "Cannot parse FileIO from non-object: %s", json);
    String impl = JsonUtil.getString(FILE_IO_IMPL, json);
    Map<String, String> properties = JsonUtil.getStringMap(PROPERTIES, json);

    if (conf == null && impl.equals(HadoopFileIO.class.getCanonicalName()) {
      conf = new Configuration();   // load the Hadoop conf from host
    }

    return CatalogUtil.loadFileIO(impl, properties, conf);
  }
  • pro: this patch requires minimal change and doesn't introduce any behavior change.
  • con: it would introduce Hadoop class dependency in FileIOParser, which could be problematic down the road for removing Hadoop dependency from iceberg-core.

However, I feel Hadoop class dependency inevitable today for two reasons. (1) HadoopFileIO uses both Hadoop Configuration and Map<String, String> properties, while other FileIO implementations like S3FileIO only depends on Map<String, String> properties. (2) FileIOParser is designed to be able to serialize and deserialize all FileIO implementations (including HadoopFileIO).

Personally, I would prefer to go forward with this stop-gap solution for now while we are gathering more consensus on the long-term solutions (one listed below).

Option 2: change HadoopFileIO to also uses Map<String, String> properties as the only configuration input

Internally, HadoopFileIO loads the default Hadoop Configuration from the host and apply the Map<String, String> properties as override.

  • pro: this would standardize the FileIO behavior to make FileIOParser JSON serialization uniform across all FileIO implementations. Note that FileIO interface only defines getter for Map properties Map<String, String> properties().
  • con: it is a breaking change that modifies the HadoopFileIO construction behavior.

Option 2 is a long-term solution IMO.

@nastra
Copy link
Contributor

nastra commented Oct 8, 2024

Let me recap the problem and potential options

problem: FileIOParser doesn't serialize Hadoop Configuration for HadoopFileIO. As a result, deserialized HadoopFileIO has NPE problem.

Option 1: FileIOParser.fromJson load the Hadoop Configuration from host if the input arg is null.

@stevenzwu for Option 1, wouldn't https://github.com/apache/iceberg/pull/10926/files#r1718243019 also solve the issue with the NPE without introducing a Hadoop dependency on the Parser itself?

@stevenzwu
Copy link
Contributor Author

stevenzwu commented Oct 9, 2024

@stevenzwu for Option 1, wouldn't https://github.com/apache/iceberg/pull/10926/files#r1718243019 also solve the issue with the NPE without introducing a Hadoop dependency on the Parser itself?

agree it is better to avoid the Hadoop dep on the FileIOParser.

I will make a small change to your proposal to cache and set the loaded Hadoop Configuration object to hadoopConf to avoid repeated loading.

Verified

This commit was signed with the committer’s verified signature.
mislav Mislav Marohnić
@stevenzwu stevenzwu force-pushed the fix-hadoop-fileio-initialize branch from 8b0bfbe to 1dfcf72 Compare October 9, 2024 16:50
@@ -74,7 +74,7 @@ public HadoopFileIO(SerializableSupplier<Configuration> hadoopConf) {
}

public Configuration conf() {
return hadoopConf.get();
return getConf();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's enough to just update this call. You'd basically need to update every place in this class that uses hadoopConf.get() to conf() or to getConf()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are absolutely correct. sorry that I missed this earlier. updated the PR.

Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one suggestion

…java

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
Copy link

@ashvina ashvina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a minor comment; otherwise, it LGTM.

// Create a default hadoopConf as it is required for the object to be valid.
// E.g. newInputFile would throw NPE with getConf() otherwise.
if (hadoopConf == null) {
this.hadoopConf = new SerializableConfiguration(new Configuration())::get;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a potential for a race condition here? Do you see a need to add synchronization to protect this instantiation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, FileIO can potentially be used by multiple threads. let me add synchronization.

@stevenzwu stevenzwu merged commit f4ffe13 into apache:main Oct 17, 2024
49 checks passed
@stevenzwu stevenzwu deleted the fix-hadoop-fileio-initialize branch October 17, 2024 21:31
@stevenzwu
Copy link
Contributor Author

thanks @nastra @pvary @Fokko @rdblue @ashvina for the review

zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
…opFileIO because FileIOParser doesn't serialize Hadoop configuration (apache#10926)

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants