-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid pyarrow.fs
import for local storage
#14321
Avoid pyarrow.fs
import for local storage
#14321
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm +1000 on this change. Thanks @rjzamora !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions, but LGTM
Is this sufficient? The import of pyarrow.dataset (here https://github.com/rapidsai/cudf/blob/branch-23.12/python/cudf/cudf/io/parquet.py#L18) also seems to provoke importing Can we add a test (maybe run in a subprocess) that checks that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support @wence-'s request for a test but otherwise LGTM.
This is a good point. However, do we care about import sys
import pyarrow
pa_mods = set(sys.modules)
_fs = "pyarrow._fs"
_s3fs = "pyarrow._s3fs"
print("import pyarrow.dataset")
import pyarrow.dataset
ds_mods = set(sys.modules)
print(f"{_fs} imported: {_fs in (ds_mods - pa_mods)}")
print(f"{_s3fs} imported: {_s3fs in (ds_mods - pa_mods)}")
print("import pyarrow.fs")
import pyarrow.fs
fs_mods = set(sys.modules)
print(f"{_s3fs} imported: {_s3fs in (fs_mods - ds_mods)}")
A test is a good idea. Let me know if you think that we can just check that |
I have some bad news. I started adding the test suggested by @wence- (thanks Lawrence!). However, I then discovered that pyarrow indeed imports In [1]: import sys
In [2]: mods = set(sys.modules)
In [3]: import pyarrow.parquet as pq
In [4]: after_mods = set(sys.modules)
In [5]: "pyarrow._s3fs" in after_mods
Out[5]: True |
@@ -533,3 +533,17 @@ def test_write_chunked_parquet(s3_base, s3so): | |||
actual.sort_values(["b"]).reset_index(drop=True), | |||
cudf.concat([df1, df2]).sort_values(["b"]).reset_index(drop=True), | |||
) | |||
|
|||
|
|||
def test_no_s3fs_on_cudf_import(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is useful, but adding it has rained a bit on our mini parade. We now see that the pyarrow._s3fs
problem pops up whenever we import pyarrow.parquet
or pyarrow.orc
. (So, much less constrained than the "remote storage only" case we originally expected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing that investigation Rick. I see you've already adjusted the remainder of the code to take that into consideration and this test is still passing, so I'm good with things as they are now.
@@ -533,3 +533,17 @@ def test_write_chunked_parquet(s3_base, s3so): | |||
actual.sort_values(["b"]).reset_index(drop=True), | |||
cudf.concat([df1, df2]).sort_values(["b"]).reset_index(drop=True), | |||
) | |||
|
|||
|
|||
def test_no_s3fs_on_cudf_import(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing that investigation Rick. I see you've already adjusted the remainder of the code to take that into consideration and this test is still passing, so I'm good with things as they are now.
Given sentiment was positive regarding this change and to reduce probability of segfaults in Dask et al., I'm merging this. Thanks a lot @rjzamora for working on this change! |
/merge |
Description
This is not a resolution, but may help mitigate problems from aws/aws-sdk-cpp#2681
Checklist