Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5188] feat(python-client): Support s3 fileset in python client #5209

Merged
merged 135 commits into from
Oct 24, 2024

Conversation

yuqi1129
Copy link
Contributor

What changes were proposed in this pull request?

Add support for S3 fileset in the Python client.

Why are the changes needed?

it's the user needs.

Fix: #5188

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Replace with real s3 account and execute the following test.

image

./gradlew :clients:client-python:test -PskipDockerTests=false

comment="",
properties={
"filesystem-providers": "s3",
"gravitino.bypass.fs.s3a.access.key": cls.s3_access_key,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for server side, maybe we should clearly define some configurations, not using "gravitino.bypass." for all configurations. I have to think a bit on this, can you please also think a bit from user side?

Copy link
Contributor Author

@yuqi1129 yuqi1129 Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao
I will use this #5220 to optimize it and won't change it in this PR.

@jerryshao
Copy link
Contributor

@xloya can you please help to review?

Comment on lines 52 to 53
S3A = "s3a"
S3 = "s3"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question, because we only use the s3a scheme in the S3FileSystemProvider(https://github.com/apache/gravitino/blob/main/bundles/aws-bundle/src/main/java/org/apache/gravitino/s3/fs/S3FileSystemProvider.java#L44), is there any case will use the s3 scheme?

"AWS endpoint url is not found in the options."
)

return importlib.import_module("pyarrow.fs").S3FileSystem(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't notice this before, GCS and S3 also have the fsspec implementation(https://github.com/fsspec/gcsfs, https://github.com/fsspec/s3fs), how do you consider the selection here to use PyArrow's implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArrow's implementation provides an uniform API to users, for example, combined with ArrowFSWrapper, we can support all kinds of storage throught API exposed by ArrowFSWrapper.

I have viewed the implementation by fsspec, it's seems that there are no big difference compared to that provided by pyarrow.

Considering the efficiency brought by arrow and arrow has been used by HDFS, so I continue to use pyarrow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, PyArrow officially supports a limited number of storage systems. If you need to expand the storage system, you need to modify the Arrow source code. HDFS chooses to use PyArrow because fsspec actually also calls PyArrow, which is almost the only choice. For other storage, PyArrow may not be the only choice. My advice is not to be restricted by the current selection. We should make the best choice in terms of performance and interface adaptability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My advice is not to be restricted by the current selection. We should make the best choice in terms of performance and interface adaptability.

I agree with this point and I also noticed that the filesystem that Pyarrow supports is very limited. Due to time limitations, I have not completed a comprehensive survey about it. thanks for your suggestion, I will modify the code accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xloya
I have replaced s3fs and gcsfs with arrowfs , please help to take a look again.

cls.fs = ArrowFSWrapper(arrow_gcs_fs)
cls.fs = GCSFileSystem(token=cls.key_file)

# Object storage like GCS does not support making directory and can only create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw gcsfs / s3fs supports mkdir according to the doc, maybe you can test them in their expected situations: https://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem.mkdir, https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.mkdir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API does support mkdir but it takes no effect and the directory will not be created for S3 and GCS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems will create the bucket in the code, could we have tests for this behavior: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L904.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let me check the directory will not be created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xloya added.

fs = context_pair.filesystem()

# S3FileSystem doesn't support maxdepth
if isinstance(fs, self.lazy_load_class("s3fs", "S3FileSystem")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest doc, s3fs seems supports the max depth param(https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.rm), is there something wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually not and the error will be:

======================================================================
ERROR: test_rm (test_gvfs_with_s3.TestGvfsWithS3)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ec2-user/gravitino/clients/client-python/tests/integration/test_gvfs_with_s3.py", line 195, in test_rm
    fs.rm(rm_file)
  File "/home/ec2-user/gravitino/clients/client-python/gravitino/filesystem/gvfs.py", line 355, in rm
    context_pair.filesystem().rm(
  File "/home/ec2-user/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ec2-user/gravitino/.gradle/python/Linux/Miniforge3/envs/python-3.8/lib/python3.8/site-packages/fsspec/asyn.py", line 85, in sync
    coro = func(*args, **kwargs)
TypeError: _rm() takes from 2 to 3 positional arguments but 4 were given

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I checked the s3fs code, it does not support max depth indeed: https://github.com/fsspec/s3fs/blob/main/s3fs/core.py#L2001.

@@ -590,7 +606,8 @@ def _convert_actual_info(
"name": path,
"size": entry["size"],
"type": entry["type"],
"mtime": entry["mtime"],
# Some file systems may not support the `mtime` field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to specify which file systems are not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@yuqi1129
Copy link
Contributor Author

@xloya Everything has been resolved.


self.assertFalse(self.fs.exists(mkdir_actual_dir))
self.assertFalse(fs.exists(mkdir_dir))
self.assertFalse(self.fs.exists("gs://" + new_bucket))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is false here, is that means create the new bucket failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean gcs will fail to create a new bucket name and ignore errors.
GCP does not allow to create a bucket throw FileSystem API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@xloya
Copy link
Contributor

xloya commented Oct 24, 2024

LGTM.


GVFS_FILESYSTEM_S3_ACCESS_KEY = "s3_access_key"
GVFS_FILESYSTEM_S3_SECRET_KEY = "s3_secret_key"
GVFS_FILESYSTEM_S3_ENDPOINT = "s3_endpoint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we should also redefine the Java side gvfs and Hadoop catalog s3/oss/gcs related configurations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in a separate PR.

@jerryshao jerryshao merged commit cefe316 into apache:main Oct 24, 2024
21 checks passed
mplmoknijb pushed a commit to mplmoknijb/gravitino that referenced this pull request Nov 6, 2024
apache#5209)

### What changes were proposed in this pull request?

Add support for S3 fileset in the Python client.


### Why are the changes needed?

it's the user needs.

Fix: apache#5188 

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Replace with real s3 account and execute the following test.

<img width="1534" alt="image"
src="https://github.com/user-attachments/assets/3d6267ce-8954-43e6-bc54-ac70998df9f9">

./gradlew :clients:client-python:test -PskipDockerTests=false
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support S3 fileset for python GVFS client
3 participants