Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#3379] feat(catalog-hadoop): Add S3 support for Fileset Hadoop catalog #4232

Merged
merged 33 commits into from
Oct 21, 2024

Conversation

xiaozcy
Copy link
Contributor

@xiaozcy xiaozcy commented Jul 22, 2024

What changes were proposed in this pull request?

Add S3 support for Fileset Hadoop catalog. We only add hadoop-aws dependency actually, most of the work is conducting tests.

Why are the changes needed?

Fix: #3379

Does this PR introduce any user-facing change?

No.

How was this patch tested?

IT.

@yuqi1129
Copy link
Contributor

@xiaozcy
Please resolve the conflicts if you are free.

gradle/libs.versions.toml Outdated Show resolved Hide resolved
zhanghan18 added 2 commits July 22, 2024 16:01
# Conflicts:
#	catalogs/catalog-hadoop/build.gradle.kts
@jerryshao
Copy link
Contributor

@FANNG1 can you please also take a look at this?

Besides, I think it would be better that s3 related configurations to be catalog/schema/fileset properties, the reason is that such properties are important to make fileset on s3 work, we'd better not hiding them into the hadoop site xml.

@FANNG1
Copy link
Contributor

FANNG1 commented Jul 24, 2024

@FANNG1 can you please also take a look at this?

Besides, I think it would be better that s3 related configurations to be catalog/schema/fileset properties, the reason is that such properties are important to make fileset on s3 work, we'd better not hiding them into the hadoop site xml.

@xiaozcy could you provide a document about how to make S3 works for Fileset hadoop catalog.

# Conflicts:
#	integration-test-common/src/test/java/org/apache/gravitino/integration/test/util/AbstractIT.java
import org.slf4j.LoggerFactory;

@Tag("gravitino-docker-test")
public class HadoopCatalogS3IT extends AbstractIT {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we could do some abstraction to extract test logic in HadoopCatalogCommonIT, HadoopCatalogHDFSIT and HadoopCatalogS3IT to create environment. like SparkCommonIT and SparkHiveCatalogIT

@jerryshao
Copy link
Contributor

@xiaozcy can you please address the comments here?

# Conflicts:
#	catalogs/catalog-hadoop/build.gradle.kts
@xiaozcy
Copy link
Contributor Author

xiaozcy commented Aug 1, 2024

@jerryshao, sorry for the late reply, I have already upgraded the version of hadoop and done some abstraction of the IT, and I'm still working on managing some S3-related configurations in Gravitino.

# Conflicts:
#	catalogs/catalog-hadoop/src/test/java/org/apache/gravitino/catalog/hadoop/integration/test/HadoopCatalogIT.java
@xiaozcy
Copy link
Contributor Author

xiaozcy commented Aug 5, 2024

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

@jerryshao
Copy link
Contributor

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

@yuqi1129 what do you think?

@yuqi1129
Copy link
Contributor

yuqi1129 commented Aug 5, 2024

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

@yuqi1129 what do you think?

I think it would be best to add a flag to clearly indicate the type of authentication we will be using. The Gravitino fileset currently supports simple and Kerberos authentication. Once we set the type, we can verify that the required properties have been provided before initializing the correspond SDK client.

Besides the [common catalog properties](./gravitino-server-config.md#gravitino-catalog-properties-configuration), the Hadoop catalog has the following properties:
| Property Name | Description | Default Value | Required | Since Version |
|----------------------------------------------------|------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------------------|---------------|
| `location` | The storage location managed by Hadoop catalog. | (none) | No | 0.5.0 |
| `authentication.impersonation-enable` | Whether to enable impersonation for the Hadoop catalog. | `false` | No | 0.5.1 |
| `authentication.type` | The type of authentication for Hadoop catalog, currently we only support `kerberos`, `simple`. | `simple` | No | 0.5.1 |
| `authentication.kerberos.principal` | The principal of the Kerberos authentication | (none) | required if the value of `authentication.type` is Kerberos. | 0.5.1 |
| `authentication.kerberos.keytab-uri` | The URI of The keytab for the Kerberos authentication. | (none) | required if the value of `authentication.type` is Kerberos. | 0.5.1 |
| `authentication.kerberos.check-interval-sec` | The check interval of Kerberos credential for Hadoop catalog. | 60 | No | 0.5.1 |
| `authentication.kerberos.keytab-fetch-timeout-sec` | The fetch timeout of retrieving Kerberos keytab from `authentication.kerberos.keytab-uri`. | 60 | No | 0.5.1 |
### Authentication for Hadoop Catalog
The Hadoop catalog supports multi-level authentication to control access, allowing different authentication settings for the catalog, schema, and fileset. The priority of authentication settings is as follows: catalog < schema < fileset. Specifically:
- **Catalog**: The default authentication is `simple`.
- **Schema**: Inherits the authentication setting from the catalog if not explicitly set. For more information about schema settings, please refer to [Schema properties](#schema-properties).
- **Fileset**: Inherits the authentication setting from the schema if not explicitly set. For more information about fileset settings, please refer to [Fileset properties](#fileset-properties).
The default value of `authentication.impersonation-enable` is false, and the default value for catalogs about this configuration is false, for
schemas and filesets, the default value is inherited from the parent. Value set by the user will override the parent value, and the priority mechanism is the same as authentication.

@xiaozcy
Copy link
Contributor Author

xiaozcy commented Aug 7, 2024

@yuqi1129, could you help review this again?

@yuqi1129
Copy link
Contributor

yuqi1129 commented Aug 7, 2024

@yuqi1129, could you help review this again?

Sure.

@xiaozcy
Copy link
Contributor Author

xiaozcy commented Aug 7, 2024

UserContext userContext =
UserContext.getUserContext(
ident, properties, null, hadoopCatalogOperations.getCatalogInfo());

I wonder why the passed value to hadoop conf is null in places like this. In this way, we can not pass some configurations in schema/fileset level. Should we use hadoopCatalogOperations.getHadoopConf() instead?

@jerryshao
Copy link
Contributor

No, I would rather have a thorough solution beforehand. Supporting one cloud storage is easy, but when we add more, maintaining them will have a heavy burden, this is the burden the community should take care, not the user. The solution should not only focus on the server side Fileset catalog support, the client side GVFS should also be considered in this solution, the unified configuration thing should also be included in this solution.

So I would suggest we have a complete design doc about how to support multiple cloud storages that includes both the server and client side solutions. We can discuss based on the design doc. The design doc will make us more clear and thorough about how to well support different storages, currently, all our discussion is around S3, but what if the current solution cannot fit ADSL or GCS, how do you handle this? Besides, whether the pluggable framework should be introduced or not should be decided based on thorough investigation and discussion, not just based on whether it is urgent or blocking or not.

@xloya
Copy link
Contributor

xloya commented Aug 15, 2024

Hi all, let me share some actual production cases:
Since I'm mainly responsible for the integration between Gravitino Fileset and internal engines and platforms, I did encounter some actual dependency conflicts problems during the launch process.

  1. On the server side: Our current supports both HDFS/MiFS (an internal storage that supports various object storages) in a Fileset Catalog. When introducing MiFS, the biggest problem encountered by the server during integration testing is that the MiFS dependency conflicts with some Grav dependencies. At this time, we have to take two approaches:
    a. Contact MiFS R&D to shade these dependencies (on the public cloud storage, I think this is not possible to do).
    b. During integration testing, exclude MiFS dependencies and cannot do MiFS integration testing.
  2. In GVFS client: Since it needs to be used in Spark, we have tried to add MiFS dependencies directly to GVFS, but this may still conflict with some dependencies of Spark. We have to shade these dependencies in GVFS, and some dependencies whatever cannot be shaded. Therefore, the solution we finally adopted is to make MiFS dependencies independent of GVFS client, and ask MiFS developers to shade these dependencies before introducing them separately in Spark.

What I want to say is that with the increase of supported storages, dependency conflicts are inevitable, so on-demand loading may be a more reasonable approach. But I still hope that one Catalog can support multiple storage types, but the supported storage types can be determined by the maintainer.

zhanghan18 added 4 commits September 9, 2024 14:28
# Conflicts:
#	catalogs/catalog-hadoop/build.gradle.kts
#	gradle/libs.versions.toml
@FANNG1
Copy link
Contributor

FANNG1 commented Sep 13, 2024

@yuqi1129 yuqi1129 requested a review from jerryshao October 19, 2024 07:16
build.gradle.kts Outdated
Comment on lines 793 to 795
":bundles:aliyun-bundle:copyLibAndConfig",
":bundles:aws-bundle:copyLibAndConfig",
":bundles:gcp-bundle:copyLibAndConfig"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We will not ship these jars with gravitino binary, if you want to use it for test, you'd better figure out a different way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let me try another way to copy the bundle jars when testing S3 in deploy mode automatically.

Copy link
Contributor

@yuqi1129 yuqi1129 Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, I removed it and replaced it with adding test dependency in hadoop-catalog and hadoop3-filesystem.

@yuqi1129 yuqi1129 requested review from jerryshao and FANNG1 October 21, 2024 09:14
@jerryshao jerryshao merged commit f69bdaf into apache:main Oct 21, 2024
26 checks passed
mplmoknijb pushed a commit to mplmoknijb/gravitino that referenced this pull request Nov 6, 2024
… catalog (apache#4232)

### What changes were proposed in this pull request?

Add S3 support for Fileset Hadoop catalog. We only add hadoop-aws
dependency actually, most of the work is conducting tests.

### Why are the changes needed?

Fix: apache#3379 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

IT.

---------

Co-authored-by: zhanghan18 <[email protected]>
Co-authored-by: yuqi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add S3 support for Fileset Hadoop catalog
6 participants