Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5074] feat(hadoop-catalog): Support GCS fileset. #5079

Merged
merged 65 commits into from
Oct 17, 2024

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Oct 9, 2024

What changes were proposed in this pull request?

  1. Add a bundled jar for Hadoop GCS jar.
  2. Support GCS in Hadoop catalog.

Why are the changes needed?

Users highly demand Fileset for GCS storage.

Fix: #5074

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Manually, please see: HadoopGCPCatalogIT

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Oct 9, 2024

This PR is a follow-up of #5020 and is not ready for review until #5020 is merged.

@yuqi1129
Copy link
Contributor Author

The related documents will be in separate PR files.

@yuqi1129 yuqi1129 requested review from xloya and jerryshao October 16, 2024 11:07

@Tag("gravitino-docker-test")
@TestInstance(TestInstance.Lifecycle.PER_CLASS)
@Disabled(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if ITs are not possible without a valid account, then we can add some unit tests for GCS. As I know, Iceberg had already implemented this, you can refer this: https://github.com/apache/iceberg/blob/main/gcp/src/test/java/org/apache/iceberg/gcp/gcs/GCSFileIOTest.java. But I haven't verified the feasibility, please help confirm this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check it.

build.gradle.kts Outdated
@@ -764,7 +764,7 @@ tasks {
!it.name.startsWith("integration-test") &&
!it.name.startsWith("flink") &&
!it.name.startsWith("trino-connector") &&
it.name != "hive-metastore-common"
it.name != "hive-metastore-common" && it.name != "gcs-bundle"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcp-bundle, not gcs

@Disabled(
"Disabled due to as we don't have a real GCP account to test. If you have a GCP account,"
+ "please change the configuration(YOUR_KEY_FILE, YOUR_BUCKET) and enable this test.")
public class HadoopGCPCatalogIT extends HadoopCatalogIT {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to clarify the difference between gcp and gcs.

}

dependencies {
compileOnly(project(":catalogs:catalog-hadoop"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess compileOnly is not enough for gvfs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this depend as implementation in module filesysem-hadoop3, so I believe it's unnecessary for gcs-bundle jar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

public static final String SERVICE_ACCOUNT_FILE = "YOUR_KEY_FILE";

@BeforeAll
public void setup() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better you can also have gvfs test for gcs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@jerryshao
Copy link
Contributor

How does user use gcp-bundle with gvfs, can you please give me an example?

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Oct 17, 2024

@jerryshao

I have verify the code, from the client size, the users should include the following dependencies if he wants to use gcs fileset

         <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.1.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs-client</artifactId>
            <version>3.1.0</version>
        </dependency>
       
        <dependency>
            <groupId>org.apache.gravitino</groupId>
            <artifactId>gcp-bundle</artifactId>
            <version>0.7.0-incubating-SNAPSHOT</version>
        </dependency>

        <dependency>
            <groupId>org.apache.gravitino</groupId>
            <artifactId>filesystem-hadoop3-runtime</artifactId>
            <version>0.7.0-incubating-SNAPSHOT</version>
        </dependency>

suggested by @xloya, there may be conflicts if we include hadoop-common and hadoop-hdfs-client into filesystem-hadoop3-runtime, in some query engines like Spark and Trino, there may already be hdfs-reletad jars in the context.

The reason why we need to include hadoop-hdfs-client is that filesystem-hadoop3-runtime has shaded hadoop-catalog and hadoop-catalog contains DistributedFileSysem initialization logic (HDFSFileSystemProvider) even thought I just want to use GCS.

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Oct 17, 2024

How does user use gcp-bundle with gvfs, can you please give me an example?

You can use GravitinoVirtualFileSystemIT.testCreate first in deploy mode and block it to start the gravitino server.

  1. Add the following dependency:
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>3.1.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.gravitino</groupId>
            <artifactId>gcp-bundle</artifactId>
            <version>0.7.0-incubating-SNAPSHOT</version>
        </dependency>

        <dependency>
            <groupId>org.apache.gravitino</groupId>
            <artifactId>filesystem-hadoop3-runtime</artifactId>
            <version>0.7.0-incubating-SNAPSHOT</version>
        </dependency>
  1. The following is an example code.
  public static void main(String[] args) throws IOException {

    Configuration conf = new Configuration();
    conf.set("fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
    conf.set("fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs");
    conf.set("fs.gvfs.impl.disable.cache", "true");
    conf.set("fs.gravitino.server.uri", "http://127.0.0.1:8090");
    conf.set("fs.gravitino.client.metalake", "gvfs_it_metalake_1fd37007");

    // Pass this configuration to the real file system
    conf.set("gravitino.bypass.fs.gs.auth.service.account.enable", "true");
    conf.set("gravitino.bypass.fs.gs.auth.service.account.json.keyfile", SERVICE_ACCOUNT_FILE);
    conf.set(FS_FILESYSTEM_PROVIDERS, "gcs");

    String gvfsPath = "gvfs://fileset/catalog_704cac87/schema_858f9d26/test_fileset_create_88eaed4e";
    Path path = new Path(gvfsPath);

    FileSystem f = path.getFileSystem(conf);

    System.out.println("fileSystem: " + f);

    String filePath = gvfsPath + "/test.txt";

    System.out.println(f.exists(new Path(filePath)));
  }

@jerryshao
I have confirmed locally and all are as expected.

@jerryshao jerryshao merged commit 93cdbc2 into apache:main Oct 17, 2024
26 checks passed
mplmoknijb pushed a commit to mplmoknijb/gravitino that referenced this pull request Nov 6, 2024
### What changes were proposed in this pull request?

1.  Add a bundled jar for Hadoop GCS jar.
2. Support GCS in Hadoop catalog.

### Why are the changes needed?

Users highly demand Fileset for GCS storage.

Fix: apache#5074 

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Manually, please see: HadoopGCPCatalogIT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support GCS for fileset catalog
3 participants