-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#2059] feat(client-python): Support Gravitino Virtual FileSystem in Python #3528
[#2059] feat(client-python): Support Gravitino Virtual FileSystem in Python #3528
Conversation
2420ab8
to
18dcf58
Compare
52b038a
to
230b4d4
Compare
08f8477
to
c25726c
Compare
Hi @xloya I just found that fsspec have their wrapper for PyArrow and HadoopFileSystem, maybe we can develop our HDFS solution based on theirs? |
Yeah, actually current implementation is customized based on the |
Yeah, but why not just use their implementation and add our features (ex. lock, custom handlers) on them? |
|
Btw, the code is not complete yet, I need to do more tests about the interface compatibility. |
Sure, would love to help if needed. |
Does it support read/write now? I would like to do a local test to see the current status. |
Currently I have tested the following interfaces and they can all be executed normally with the fileset mounts on HDFS:
The python script:
|
a9abe92
to
8156fe6
Compare
Currently, Python integration testings lacks the support for some Docker environments, so I am considering not adding HDFS integration testings to the current PR, but supporting it separately in another PR (#3760) to reduce the difficulty of code review. |
@jerryshao @xunliu @shaofengshi @noidname01 This PR is ready for review, could you take a look when you have time? Thanks. |
I will finish the review before this weekend, thanks a lot @xloya for your work. |
@noidname01 can you please also help to review? |
Sure, I will review on this weekend. |
3881a31
to
7c87f33
Compare
"test_pandas", f"{_fileset_dir}/test_pandas" | ||
), | ||
) | ||
def test_pandas(self, mock_method1, mock_method2, mock_method3, mock_method4): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some unit tests for engines' apis like pandas, pyarrow, llama_index.
@jerryshao @noidname01 Have addressed the comments, could you take a look again? Thanks. |
LGTM, @jerryshao WDYT? |
Sorry for late response, I need to take another round of review, will finish this before Tuesday. |
readerwriterlock==1.0.9 | ||
fsspec==2024.3.1 | ||
pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to specify the version of pyarrow
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyarrow version is already specified in requirement-dev.txt
, which I think is no longer needed here.
@@ -6,3 +6,7 @@ pylint==3.2.2 | |||
black==24.4.2 | |||
twine==5.1.0 | |||
coverage==7.5.1 | |||
pandas==2.0.3 | |||
pyarrow==15.0.2 | |||
llama-index==0.10.40 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need package like pandas
, llama-index
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because unit tests and integration tests will include tests with these third-party libs' APIs. In order to ensure the stability of unit tests and integration tests, the corresponding feasible versions are specified here. I'm not sure if we need to add tests for integration with third-party libs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @xloya for your contribution.
…em in Python (apache#3528) ### What changes were proposed in this pull request? Support Gravitino Virtual File System in Python so that we can read and write Fileset storage data. The first PR only supports HDFS. After research, the following popular cloud storages or companies have implemented their own FileSystem based on fsspec(https://filesystem-spec.readthedocs.io/en/latest/index.html): 1. S3(https://github.com/fsspec/s3fs) 2. Azure(https://github.com/fsspec/adlfs) 3. Gcs(https://github.com/fsspec/gcsfs) 4. OSS(https://github.com/fsspec/ossfs) 5. Databricks(https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/dbfs.py) 6. Snowflake(https://github.com/snowflakedb/snowflake-ml-python), So this PR will implement GVFS based on the fsspec interface. ### Why are the changes needed? Fix: apache#2059 ### How was this patch tested? Add some UTs and ITs. --------- Co-authored-by: xiaojiebao <[email protected]>
…tion tests (#3876) ### What changes were proposed in this pull request? Add Hive Docker env for client-python, and add integration tests for PyGVFS + HDFS. Depends on #3528. ### Why are the changes needed? Fix: #3760 ### How was this patch tested? Add some ITs. --------- Co-authored-by: xiaojiebao <[email protected]>
…Python (#3528) ### What changes were proposed in this pull request? Support Gravitino Virtual File System in Python so that we can read and write Fileset storage data. The first PR only supports HDFS. After research, the following popular cloud storages or companies have implemented their own FileSystem based on fsspec(https://filesystem-spec.readthedocs.io/en/latest/index.html): 1. S3(https://github.com/fsspec/s3fs) 2. Azure(https://github.com/fsspec/adlfs) 3. Gcs(https://github.com/fsspec/gcsfs) 4. OSS(https://github.com/fsspec/ossfs) 5. Databricks(https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/dbfs.py) 6. Snowflake(https://github.com/snowflakedb/snowflake-ml-python), So this PR will implement GVFS based on the fsspec interface. ### Why are the changes needed? Fix: #2059 ### How was this patch tested? Add some UTs and ITs. --------- Co-authored-by: xiaojiebao <[email protected]> (cherry picked from commit cddef88)
What changes were proposed in this pull request?
Support Gravitino Virtual File System in Python so that we can read and write Fileset storage data. The first PR only supports HDFS.
After research, the following popular cloud storages or companies have implemented their own FileSystem based on fsspec(https://filesystem-spec.readthedocs.io/en/latest/index.html):
So this PR will implement GVFS based on the fsspec interface.
Why are the changes needed?
Fix: #2059
How was this patch tested?
Add some UTs and ITs.