-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5188] feat(python-client): Support s3 fileset in python client #5209
Conversation
comment="", | ||
properties={ | ||
"filesystem-providers": "s3", | ||
"gravitino.bypass.fs.s3a.access.key": cls.s3_access_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for server side, maybe we should clearly define some configurations, not using "gravitino.bypass." for all configurations. I have to think a bit on this, can you please also think a bit from user side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao
I will use this #5220 to optimize it and won't change it in this PR.
@xloya can you please help to review? |
S3A = "s3a" | ||
S3 = "s3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question, because we only use the s3a scheme in the S3FileSystemProvider
(https://github.com/apache/gravitino/blob/main/bundles/aws-bundle/src/main/java/org/apache/gravitino/s3/fs/S3FileSystemProvider.java#L44), is there any case will use the s3 scheme?
"AWS endpoint url is not found in the options." | ||
) | ||
|
||
return importlib.import_module("pyarrow.fs").S3FileSystem( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't notice this before, GCS and S3 also have the fsspec implementation(https://github.com/fsspec/gcsfs, https://github.com/fsspec/s3fs), how do you consider the selection here to use PyArrow's implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArrow's implementation provides an uniform API to users, for example, combined with ArrowFSWrapper
, we can support all kinds of storage throught API exposed by ArrowFSWrapper
.
I have viewed the implementation by fsspec
, it's seems that there are no big difference compared to that provided by pyarrow
.
Considering the efficiency brought by arrow
and arrow
has been used by HDFS
, so I continue to use pyarrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, PyArrow officially supports a limited number of storage systems. If you need to expand the storage system, you need to modify the Arrow source code. HDFS chooses to use PyArrow because fsspec actually also calls PyArrow, which is almost the only choice. For other storage, PyArrow may not be the only choice. My advice is not to be restricted by the current selection. We should make the best choice in terms of performance and interface adaptability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My advice is not to be restricted by the current selection. We should make the best choice in terms of performance and interface adaptability.
I agree with this point and I also noticed that the filesystem that Pyarrow supports is very limited. Due to time limitations, I have not completed a comprehensive survey about it. thanks for your suggestion, I will modify the code accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xloya
I have replaced s3fs
and gcsfs
with arrowfs
, please help to take a look again.
cls.fs = ArrowFSWrapper(arrow_gcs_fs) | ||
cls.fs = GCSFileSystem(token=cls.key_file) | ||
|
||
# Object storage like GCS does not support making directory and can only create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw gcsfs / s3fs supports mkdir
according to the doc, maybe you can test them in their expected situations: https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem.mkdir, https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.mkdir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API does support mkdir
but it takes no effect and the directory
will not be created for S3 and GCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems will create the bucket in the code, could we have tests for this behavior: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L904.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let me check the directory will not be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xloya added.
fs = context_pair.filesystem() | ||
|
||
# S3FileSystem doesn't support maxdepth | ||
if isinstance(fs, self.lazy_load_class("s3fs", "S3FileSystem")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest doc, s3fs seems supports the max depth param(https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.rm), is there something wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually not and the error will be:
======================================================================
ERROR: test_rm (test_gvfs_with_s3.TestGvfsWithS3)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/ec2-user/gravitino/clients/client-python/tests/integration/test_gvfs_with_s3.py", line 195, in test_rm
fs.rm(rm_file)
File "/home/ec2-user/gravitino/clients/client-python/gravitino/filesystem/gvfs.py", line 355, in rm
context_pair.filesystem().rm(
File "/home/ec2-user/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/ec2-user/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 85, in sync
coro = func(*args, **kwargs)
TypeError: _rm() takes from 2 to 3 positional arguments but 4 were given
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I checked the s3fs code, it does not support max depth indeed: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L2001.
@@ -590,7 +606,8 @@ def _convert_actual_info( | |||
"name": path, | |||
"size": entry["size"], | |||
"type": entry["type"], | |||
"mtime": entry["mtime"], | |||
# Some file systems may not support the `mtime` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to specify which file systems are not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay
@xloya Everything has been resolved. |
|
||
self.assertFalse(self.fs.exists(mkdir_actual_dir)) | ||
self.assertFalse(fs.exists(mkdir_dir)) | ||
self.assertFalse(self.fs.exists("gs://" + new_bucket)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is false here, is that means create the new bucket failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean gcs
will fail to create a new bucket name and ignore errors.
GCP does not allow to create a bucket throw FileSystem API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
LGTM. |
|
||
GVFS_FILESYSTEM_S3_ACCESS_KEY = "s3_access_key" | ||
GVFS_FILESYSTEM_S3_SECRET_KEY = "s3_secret_key" | ||
GVFS_FILESYSTEM_S3_ENDPOINT = "s3_endpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that we should also redefine the Java side gvfs and Hadoop catalog s3/oss/gcs related configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this in a separate PR.
apache#5209) ### What changes were proposed in this pull request? Add support for S3 fileset in the Python client. ### Why are the changes needed? it's the user needs. Fix: apache#5188 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Replace with real s3 account and execute the following test. <img width="1534" alt="image" src="https://github.com/user-attachments/assets/3d6267ce-8954-43e6-bc54-ac70998df9f9"> ./gradlew :clients:client-python:test -PskipDockerTests=false
What changes were proposed in this pull request?
Add support for S3 fileset in the Python client.
Why are the changes needed?
it's the user needs.
Fix: #5188
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
Replace with real s3 account and execute the following test.
./gradlew :clients:client-python:test -PskipDockerTests=false