[#5188] feat(python-client): Support s3 fileset in python client #5209

yuqi1129 · 2024-10-22T03:52:11Z

What changes were proposed in this pull request?

Add support for S3 fileset in the Python client.

Why are the changes needed?

it's the user needs.

Fix: #5188

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Replace with real s3 account and execute the following test.

./gradlew :clients:client-python:test -PskipDockerTests=false

…fileset catalog

…_5019

jerryshao · 2024-10-22T04:44:35Z

clients/client-python/tests/integration/test_gvfs_with_s3.py

+            comment="",
+            properties={
+                "filesystem-providers": "s3",
+                "gravitino.bypass.fs.s3a.access.key": cls.s3_access_key,


Also for server side, maybe we should clearly define some configurations, not using "gravitino.bypass." for all configurations. I have to think a bit on this, can you please also think a bit from user side?

@jerryshao
I will use this #5220 to optimize it and won't change it in this PR.

jerryshao · 2024-10-22T04:50:53Z

@xloya can you please help to review?

xloya · 2024-10-22T06:26:41Z

clients/client-python/gravitino/filesystem/gvfs.py

+    S3A = "s3a"
+    S3 = "s3"


I have the same question, because we only use the s3a scheme in the S3FileSystemProvider(https://github.com/apache/gravitino/blob/main/bundles/aws-bundle/src/main/java/org/apache/gravitino/s3/fs/S3FileSystemProvider.java#L44), is there any case will use the s3 scheme?

xloya · 2024-10-22T06:31:51Z

clients/client-python/gravitino/filesystem/gvfs.py

+                "AWS endpoint url is not found in the options."
+            )
+
+        return importlib.import_module("pyarrow.fs").S3FileSystem(


Sorry I didn't notice this before, GCS and S3 also have the fsspec implementation(https://github.com/fsspec/gcsfs, https://github.com/fsspec/s3fs), how do you consider the selection here to use PyArrow's implementation?

PyArrow's implementation provides an uniform API to users, for example, combined with ArrowFSWrapper, we can support all kinds of storage throught API exposed by ArrowFSWrapper.

I have viewed the implementation by fsspec, it's seems that there are no big difference compared to that provided by pyarrow.

Considering the efficiency brought by arrow and arrow has been used by HDFS, so I continue to use pyarrow

In fact, PyArrow officially supports a limited number of storage systems. If you need to expand the storage system, you need to modify the Arrow source code. HDFS chooses to use PyArrow because fsspec actually also calls PyArrow, which is almost the only choice. For other storage, PyArrow may not be the only choice. My advice is not to be restricted by the current selection. We should make the best choice in terms of performance and interface adaptability.

My advice is not to be restricted by the current selection. We should make the best choice in terms of performance and interface adaptability.

I agree with this point and I also noticed that the filesystem that Pyarrow supports is very limited. Due to time limitations, I have not completed a comprehensive survey about it. thanks for your suggestion, I will modify the code accordingly.

@xloya
I have replaced s3fs and gcsfs with arrowfs , please help to take a look again.

xloya · 2024-10-23T11:01:12Z

clients/client-python/tests/integration/test_gvfs_with_gcs.py

-        cls.fs = ArrowFSWrapper(arrow_gcs_fs)
+        cls.fs = GCSFileSystem(token=cls.key_file)
+
+    # Object storage like GCS does not support making directory and can only create


I saw gcsfs / s3fs supports mkdir according to the doc, maybe you can test them in their expected situations: https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem.mkdir, https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.mkdir.

The API does support mkdir but it takes no effect and the directory will not be created for S3 and GCS.

It seems will create the bucket in the code, could we have tests for this behavior: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L904.

Ok, let me check the directory will not be created.

@xloya added.

xloya · 2024-10-23T11:05:45Z

clients/client-python/gravitino/filesystem/gvfs.py

+        fs = context_pair.filesystem()
+
+        # S3FileSystem doesn't support maxdepth
+        if isinstance(fs, self.lazy_load_class("s3fs", "S3FileSystem")):


In the latest doc, s3fs seems supports the max depth param(https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.rm), is there something wrong?

It actually not and the error will be:

====================================================================== ERROR: test_rm (test_gvfs_with_s3.TestGvfsWithS3) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/ec2-user/gravitino/clients/client-python/tests/integration/test_gvfs_with_s3.py", line 195, in test_rm fs.rm(rm_file) File "/home/ec2-user/gravitino/clients/client-python/gravitino/filesystem/gvfs.py", line 355, in rm context_pair.filesystem().rm( File "/home/ec2-user/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 118, in wrapper return sync(self.loop, func, *args, **kwargs) File "/home/ec2-user/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 85, in sync coro = func(*args, **kwargs) TypeError: _rm() takes from 2 to 3 positional arguments but 4 were given

I see. I checked the s3fs code, it does not support max depth indeed: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L2001.

xloya · 2024-10-23T11:07:18Z

clients/client-python/gravitino/filesystem/gvfs.py

@@ -590,7 +606,8 @@ def _convert_actual_info(
            "name": path,
            "size": entry["size"],
            "type": entry["type"],
-            "mtime": entry["mtime"],
+            # Some file systems may not support the `mtime` field.


It's better to specify which file systems are not supported.

yuqi1129 · 2024-10-23T15:08:45Z

@xloya Everything has been resolved.

xloya · 2024-10-24T01:42:00Z

clients/client-python/tests/integration/test_gvfs_with_gcs.py

+
+        self.assertFalse(self.fs.exists(mkdir_actual_dir))
+        self.assertFalse(fs.exists(mkdir_dir))
+        self.assertFalse(self.fs.exists("gs://" + new_bucket))


Why is false here, is that means create the new bucket failed?

I mean gcs will fail to create a new bucket name and ignore errors.
GCP does not allow to create a bucket throw FileSystem API

xloya · 2024-10-24T03:03:43Z

LGTM.

jerryshao · 2024-10-24T03:21:42Z

clients/client-python/gravitino/filesystem/gvfs_config.py

+
+    GVFS_FILESYSTEM_S3_ACCESS_KEY = "s3_access_key"
+    GVFS_FILESYSTEM_S3_SECRET_KEY = "s3_secret_key"
+    GVFS_FILESYSTEM_S3_ENDPOINT = "s3_endpoint"


I suggest that we should also redefine the Java side gvfs and Hadoop catalog s3/oss/gcs related configurations.

We can do this in a separate PR.

apache#5209) ### What changes were proposed in this pull request? Add support for S3 fileset in the Python client. ### Why are the changes needed? it's the user needs. Fix: apache#5188 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Replace with real s3 account and execute the following test. <img width="1534" alt="image" src="https://github.com/user-attachments/assets/3d6267ce-8954-43e6-bc54-ac70998df9f9"> ./gradlew :clients:client-python:test -PskipDockerTests=false

yuqi1129 and others added 30 commits September 25, 2024 19:33

Add a framework to support multi-storage in a pluginized manner for …

d2447a2

…fileset catalog

Fix compile distribution error.

7e5a8b5

fix

f53c5ef

fix

36fedcd

fix

e93fba5

fix

b1e04b6

Changed according to comments.

db00e65

fix

c793582

fix

013f5cb

Merge branch 'main' of github.com:datastrato/graviton into issue_5019

dba5753

resolve comments.

16dfc73

Polish code.

278fcd8

fix

3fb55ad

fix

cd04666

Support GCS fileset.

d0bf13e

Change gvfs accordingly.

ffaa064

Merge remote-tracking branch 'me/issue_5019' into issue_5019

32d7f3d

Update Java doc for FileSystemProvider

d82bf76

Fix

dfdb772

Fix

8708a8a

Fix test error.

ba9f8fa

Polish.

dae99f7

Polish

e22053b

Polish

4fb89e0

Rename AbstractIT to BaseIT

e5746c0

Fix

b2d7bed

Merge branch 'apache:main' into issue_5019

380717b

Fix python ut error again.

f4041ec

Merge branch 'issue_5019' of github.com:yuqi1129/gravitino into issue…

66247ab

…_5019

Fix test error again.

3cfb94f

jerryshao reviewed Oct 22, 2024

View reviewed changes

xloya reviewed Oct 22, 2024

View reviewed changes

fix

6958aa8

yuqi1129 mentioned this pull request Oct 23, 2024

[#5221] feat(python-client): Support OSS for fileset python client #5225

Merged

yuqi1129 added 11 commits October 23, 2024 10:31

Replace pyarrow gvfs with gcsfs.

8b9b8d7

fix

b4d5728

fix

0306ac5

fix

a35a40d

fix

cfd8a89

fix

a131564

Replace ArrowFSWrapper with s3fs.S3FileSystem.

34e3ff4

fix

2cc234a

fix

ba0237e

fix

4df5ea1

fix

4e49e6e

xloya reviewed Oct 23, 2024

View reviewed changes

yuqi1129 added 5 commits October 23, 2024 20:09

fix comments.

015b788

fix test error.

c414b4d

fix

798b4d1

fix

aa96f63

fix

804622c

xloya reviewed Oct 24, 2024

View reviewed changes

jerryshao reviewed Oct 24, 2024

View reviewed changes

jerryshao approved these changes Oct 24, 2024

View reviewed changes

jerryshao merged commit cefe316 into apache:main Oct 24, 2024
21 checks passed

orenccl mentioned this pull request Dec 17, 2024

[#5624] feat(bundles): support ADLS credential provider #5737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#5188] feat(python-client): Support s3 fileset in python client #5209

[#5188] feat(python-client): Support s3 fileset in python client #5209

yuqi1129 commented Oct 22, 2024

jerryshao Oct 22, 2024

yuqi1129 Oct 22, 2024 •

edited

Loading

jerryshao commented Oct 22, 2024

xloya Oct 22, 2024

xloya Oct 22, 2024

yuqi1129 Oct 22, 2024

xloya Oct 23, 2024

yuqi1129 Oct 23, 2024

yuqi1129 Oct 23, 2024

xloya Oct 23, 2024

yuqi1129 Oct 23, 2024

xloya Oct 23, 2024

yuqi1129 Oct 23, 2024

yuqi1129 Oct 23, 2024

xloya Oct 23, 2024

yuqi1129 Oct 23, 2024

xloya Oct 23, 2024

xloya Oct 23, 2024

yuqi1129 Oct 23, 2024

yuqi1129 commented Oct 23, 2024

xloya Oct 24, 2024

yuqi1129 Oct 24, 2024

xloya Oct 24, 2024

xloya commented Oct 24, 2024

jerryshao Oct 24, 2024

jerryshao Oct 24, 2024

[#5188] feat(python-client): Support s3 fileset in python client #5209

[#5188] feat(python-client): Support s3 fileset in python client #5209

Conversation

yuqi1129 commented Oct 22, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

yuqi1129 Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

jerryshao commented Oct 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi1129 commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xloya commented Oct 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi1129 Oct 22, 2024 •

edited

Loading