[#3379] feat(catalog-hadoop): Add S3 support for Fileset Hadoop catalog #4232

xiaozcy · 2024-07-22T02:37:02Z

What changes were proposed in this pull request?

Add S3 support for Fileset Hadoop catalog. We only add hadoop-aws dependency actually, most of the work is conducting tests.

Why are the changes needed?

Fix: #3379

Does this PR introduce any user-facing change?

No.

How was this patch tested?

IT.

...st-common/src/test/java/org/apache/gravitino/integration/test/container/S3MockContainer.java

catalogs/catalog-hadoop/build.gradle.kts

yuqi1129 · 2024-07-22T07:30:24Z

@xiaozcy
Please resolve the conflicts if you are free.

gradle/libs.versions.toml

# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts

catalogs/catalog-hadoop/build.gradle.kts

jerryshao · 2024-07-24T03:39:01Z

@FANNG1 can you please also take a look at this?

Besides, I think it would be better that s3 related configurations to be catalog/schema/fileset properties, the reason is that such properties are important to make fileset on s3 work, we'd better not hiding them into the hadoop site xml.

...gration-test-common/src/test/java/org/apache/gravitino/integration/test/util/AbstractIT.java

FANNG1 · 2024-07-24T13:18:04Z

@FANNG1 can you please also take a look at this?

Besides, I think it would be better that s3 related configurations to be catalog/schema/fileset properties, the reason is that such properties are important to make fileset on s3 work, we'd better not hiding them into the hadoop site xml.

@xiaozcy could you provide a document about how to make S3 works for Fileset hadoop catalog.

# Conflicts: # integration-test-common/src/test/java/org/apache/gravitino/integration/test/util/AbstractIT.java

FANNG1 · 2024-07-25T07:52:44Z

...op/src/test/java/org/apache/gravitino/catalog/hadoop/integration/test/HadoopCatalogS3IT.java

+import org.slf4j.LoggerFactory;
+
+@Tag("gravitino-docker-test")
+public class HadoopCatalogS3IT extends AbstractIT {


Ideally, we could do some abstraction to extract test logic in HadoopCatalogCommonIT, HadoopCatalogHDFSIT and HadoopCatalogS3IT to create environment. like SparkCommonIT and SparkHiveCatalogIT

jerryshao · 2024-08-01T12:58:46Z

@xiaozcy can you please address the comments here?

# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts

xiaozcy · 2024-08-01T13:41:09Z

@jerryshao, sorry for the late reply, I have already upgraded the version of hadoop and done some abstraction of the IT, and I'm still working on managing some S3-related configurations in Gravitino.

# Conflicts: # catalogs/catalog-hadoop/src/test/java/org/apache/gravitino/catalog/hadoop/integration/test/HadoopCatalogIT.java

xiaozcy · 2024-08-05T06:44:18Z

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

jerryshao · 2024-08-05T11:59:45Z

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

@yuqi1129 what do you think?

yuqi1129 · 2024-08-05T12:18:03Z

To make fileset works on s3, we may have to add configurations like fs.s3a.access.key and fs.s3a.secret.key to hadoop conf. I'm not sure whether we should add an another authentication type, or just simply add them to catalog/schema/fileset properties, what's your opinion about this? @jerryshao @yuqi1129

@yuqi1129 what do you think?

I think it would be best to add a flag to clearly indicate the type of authentication we will be using. The Gravitino fileset currently supports simple and Kerberos authentication. Once we set the type, we can verify that the required properties have been provided before initializing the correspond SDK client.

gravitino/docs/hadoop-catalog.md

Lines 26 to 48 in 2d57f63

    
           Besides the [common catalog properties](./gravitino-server-config.md#gravitino-catalog-properties-configuration), the Hadoop catalog has the following properties: 
        
           | Property Name                                      | Description                                                                                    | Default Value | Required                                                    | Since Version | 
        
           |----------------------------------------------------|------------------------------------------------------------------------------------------------|---------------|-------------------------------------------------------------|---------------| 
        
           | `location`                                         | The storage location managed by Hadoop catalog.                                                | (none)        | No                                                          | 0.5.0         | 
        
           | `authentication.impersonation-enable`              | Whether to enable impersonation for the Hadoop catalog.                                        | `false`       | No                                                          | 0.5.1         | 
        
           | `authentication.type`                              | The type of authentication for Hadoop catalog, currently we only support `kerberos`, `simple`. | `simple`      | No                                                          | 0.5.1         | 
        
           | `authentication.kerberos.principal`                | The principal of the Kerberos authentication                                                   | (none)        | required if the value of `authentication.type` is Kerberos. | 0.5.1         | 
        
           | `authentication.kerberos.keytab-uri`               | The URI of The keytab for the Kerberos authentication.                                         | (none)        | required if the value of `authentication.type` is Kerberos. | 0.5.1         | 
        
           | `authentication.kerberos.check-interval-sec`       | The check interval of Kerberos credential for Hadoop catalog.                                  | 60            | No                                                          | 0.5.1         | 
        
           | `authentication.kerberos.keytab-fetch-timeout-sec` | The fetch timeout of retrieving Kerberos keytab from `authentication.kerberos.keytab-uri`.     | 60            | No                                                          | 0.5.1         | 
        
           ### Authentication for Hadoop Catalog 
        
           The Hadoop catalog supports multi-level authentication to control access, allowing different authentication settings for the catalog, schema, and fileset. The priority of authentication settings is as follows: catalog < schema < fileset. Specifically: 
        
           - **Catalog**: The default authentication is `simple`. 
        
           - **Schema**: Inherits the authentication setting from the catalog if not explicitly set. For more information about schema settings, please refer to [Schema properties](#schema-properties). 
        
           - **Fileset**: Inherits the authentication setting from the schema if not explicitly set. For more information about fileset settings, please refer to [Fileset properties](#fileset-properties). 
        
           The default value of `authentication.impersonation-enable` is false, and the default value for catalogs about this configuration is false, for  
        
           schemas and filesets, the default value is inherited from the parent. Value set by the user will override the parent value, and the priority mechanism is the same as authentication.

xiaozcy · 2024-08-07T02:58:22Z

@yuqi1129, could you help review this again?

yuqi1129 · 2024-08-07T02:59:37Z

@yuqi1129, could you help review this again?

Sure.

xiaozcy · 2024-08-07T03:05:14Z

gravitino/catalogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/SecureHadoopCatalogOperations.java

Lines 92 to 94 in 13f3b3f

    
           UserContext userContext = 
        
               UserContext.getUserContext( 
        
                   ident, properties, null, hadoopCatalogOperations.getCatalogInfo());

I wonder why the passed value to hadoop conf is null in places like this. In this way, we can not pass some configurations in schema/fileset level. Should we use hadoopCatalogOperations.getHadoopConf() instead?

jerryshao · 2024-08-14T03:03:05Z

No, I would rather have a thorough solution beforehand. Supporting one cloud storage is easy, but when we add more, maintaining them will have a heavy burden, this is the burden the community should take care, not the user. The solution should not only focus on the server side Fileset catalog support, the client side GVFS should also be considered in this solution, the unified configuration thing should also be included in this solution.

So I would suggest we have a complete design doc about how to support multiple cloud storages that includes both the server and client side solutions. We can discuss based on the design doc. The design doc will make us more clear and thorough about how to well support different storages, currently, all our discussion is around S3, but what if the current solution cannot fit ADSL or GCS, how do you handle this? Besides, whether the pluggable framework should be introduced or not should be decided based on thorough investigation and discussion, not just based on whether it is urgent or blocking or not.

xloya · 2024-08-15T07:36:20Z

Hi all, let me share some actual production cases:
Since I'm mainly responsible for the integration between Gravitino Fileset and internal engines and platforms, I did encounter some actual dependency conflicts problems during the launch process.

On the server side: Our current supports both HDFS/MiFS (an internal storage that supports various object storages) in a Fileset Catalog. When introducing MiFS, the biggest problem encountered by the server during integration testing is that the MiFS dependency conflicts with some Grav dependencies. At this time, we have to take two approaches:
a. Contact MiFS R&D to shade these dependencies (on the public cloud storage, I think this is not possible to do).
b. During integration testing, exclude MiFS dependencies and cannot do MiFS integration testing.
In GVFS client: Since it needs to be used in Spark, we have tried to add MiFS dependencies directly to GVFS, but this may still conflict with some dependencies of Spark. We have to shade these dependencies in GVFS, and some dependencies whatever cannot be shaded. Therefore, the solution we finally adopted is to make MiFS dependencies independent of GVFS client, and ask MiFS developers to shade these dependencies before introducing them separately in Spark.

What I want to say is that with the increase of supported storages, dependency conflicts are inevitable, so on-demand loading may be a more reasonable approach. But I still hope that one Catalog can support multiple storage types, but the supported storage types can be determined by the maintainer.

# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts # gradle/libs.versions.toml

FANNG1 · 2024-09-13T07:08:34Z

we could reuse the s3 properties defined in https://github.com/apache/gravitino/pull/4897/files#diff-7434c367b3597195902b8b064a5efe5810cb0cf5e7f55228044bfcc4cd9b2abd

jerryshao · 2024-10-21T06:11:31Z

build.gradle.kts

+      ":bundles:aliyun-bundle:copyLibAndConfig",
+      ":bundles:aws-bundle:copyLibAndConfig",
+      ":bundles:gcp-bundle:copyLibAndConfig"


Why do we need this? We will not ship these jars with gravitino binary, if you want to use it for test, you'd better figure out a different way.

OK, let me try another way to copy the bundle jars when testing S3 in deploy mode automatically.

Removed, I removed it and replaced it with adding test dependency in hadoop-catalog and hadoop3-filesystem.

… catalog (apache#4232) ### What changes were proposed in this pull request? Add S3 support for Fileset Hadoop catalog. We only add hadoop-aws dependency actually, most of the work is conducting tests. ### Why are the changes needed? Fix: apache#3379 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? IT. --------- Co-authored-by: zhanghan18 <[email protected]> Co-authored-by: yuqi <[email protected]>

zhanghan18 added 5 commits July 22, 2024 10:26

feat: skeleton for fileset on s3

4aa4dc5

add some IT for fileset on s3

52fac93

resolve conflict

df38821

add some IT

5d18a17

rename

0e1c86b

yuqi1129 reviewed Jul 22, 2024

View reviewed changes

...st-common/src/test/java/org/apache/gravitino/integration/test/container/S3MockContainer.java Outdated Show resolved Hide resolved

catalogs/catalog-hadoop/build.gradle.kts Outdated Show resolved Hide resolved

code enhance

0604533

jerqi reviewed Jul 22, 2024

View reviewed changes

gradle/libs.versions.toml Outdated Show resolved Hide resolved

zhanghan18 added 2 commits July 22, 2024 16:01

Merge branch 'main' into fileset-s3

28d1bf7

# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts

use hadoop-aws version 3.3.6 in test

7ad945b

jerqi reviewed Jul 23, 2024

View reviewed changes

catalogs/catalog-hadoop/build.gradle.kts Outdated Show resolved Hide resolved

FANNG1 reviewed Jul 24, 2024

View reviewed changes

...gration-test-common/src/test/java/org/apache/gravitino/integration/test/util/AbstractIT.java Outdated Show resolved Hide resolved

Merge branch 'main' into fileset-s3

e9ecd09

# Conflicts: # integration-test-common/src/test/java/org/apache/gravitino/integration/test/util/AbstractIT.java

FANNG1 reviewed Jul 25, 2024

View reviewed changes

zhanghan18 added 2 commits July 31, 2024 09:44

upgrade to hadoop 3.3.6

ad028ee

refactor test code

8fea51c

Merge branch 'main' into fileset-s3

ae08425

# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts

Merge branch 'main' into fileset-s3

18a7bc2

# Conflicts: # catalogs/catalog-hadoop/src/test/java/org/apache/gravitino/catalog/hadoop/integration/test/HadoopCatalogIT.java

config

a2bbf0c

zhanghan18 added 4 commits September 9, 2024 14:28

Merge branch 'main' into fileset-s3

2d3dd13

# Conflicts: # catalogs/catalog-hadoop/build.gradle.kts # gradle/libs.versions.toml

refactor

c340a46

code style

278cbbd

fix version id

9cc926b

This was referenced Oct 9, 2024

[Improvement] Add document about how to use S3 and GCS fileset #5081

Closed

[FEATURE] Support GCS for python GVFS client #5139

Closed

yuqi1129 added 6 commits October 19, 2024 12:59

Merge branch 'main' of github.com:datastrato/graviton into fileset-s3

4be1f34

fix

02f498a

fix

83c3fa9

fix test error.

915f751

fix test error.

073d45c

Merge branch 'main' into fileset-s3

8c6f39e

yuqi1129 requested a review from jerryshao October 19, 2024 07:16

yuqi1129 added 2 commits October 19, 2024 16:11

fix

b44bd6b

fix

8b12ebb

yuqi1129 assigned xiaozcy and yuqi1129 Oct 21, 2024

Remove unused code.

d99a153

jerryshao reviewed Oct 21, 2024

View reviewed changes

yuqi1129 mentioned this pull request Oct 21, 2024

[FEATURE] Support S3 fileset for python GVFS client #5188

Closed

Optimize code.

7dbb936

yuqi1129 requested review from jerryshao and FANNG1 October 21, 2024 09:14

jerryshao approved these changes Oct 21, 2024

View reviewed changes

jerryshao merged commit f69bdaf into apache:main Oct 21, 2024
26 checks passed

xloya mentioned this pull request Dec 5, 2024

How does Gravitino integrate with Alluxio? #5763

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#3379] feat(catalog-hadoop): Add S3 support for Fileset Hadoop catalog #4232

[#3379] feat(catalog-hadoop): Add S3 support for Fileset Hadoop catalog #4232

xiaozcy commented Jul 22, 2024

yuqi1129 commented Jul 22, 2024

jerryshao commented Jul 24, 2024

FANNG1 commented Jul 24, 2024

FANNG1 Jul 25, 2024

jerryshao commented Aug 1, 2024

xiaozcy commented Aug 1, 2024

xiaozcy commented Aug 5, 2024

jerryshao commented Aug 5, 2024

yuqi1129 commented Aug 5, 2024

xiaozcy commented Aug 7, 2024

yuqi1129 commented Aug 7, 2024

xiaozcy commented Aug 7, 2024

jerryshao commented Aug 14, 2024

xloya commented Aug 15, 2024 •

edited

Loading

FANNG1 commented Sep 13, 2024

jerryshao Oct 21, 2024

yuqi1129 Oct 21, 2024

yuqi1129 Oct 21, 2024 •

edited

Loading

[#3379] feat(catalog-hadoop): Add S3 support for Fileset Hadoop catalog #4232

[#3379] feat(catalog-hadoop): Add S3 support for Fileset Hadoop catalog #4232

Conversation

xiaozcy commented Jul 22, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

yuqi1129 commented Jul 22, 2024

jerryshao commented Jul 24, 2024

FANNG1 commented Jul 24, 2024

FANNG1 Jul 25, 2024

Choose a reason for hiding this comment

jerryshao commented Aug 1, 2024

xiaozcy commented Aug 1, 2024

xiaozcy commented Aug 5, 2024

jerryshao commented Aug 5, 2024

yuqi1129 commented Aug 5, 2024

xiaozcy commented Aug 7, 2024

yuqi1129 commented Aug 7, 2024

xiaozcy commented Aug 7, 2024

jerryshao commented Aug 14, 2024

xloya commented Aug 15, 2024 • edited Loading

FANNG1 commented Sep 13, 2024

jerryshao Oct 21, 2024

Choose a reason for hiding this comment

yuqi1129 Oct 21, 2024

Choose a reason for hiding this comment

yuqi1129 Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

xloya commented Aug 15, 2024 •

edited

Loading

yuqi1129 Oct 21, 2024 •

edited

Loading