-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataPipe] Adding kwargs for fs.open()
in fsspec DataPipes
#804
Conversation
[ghstack-poisoned]
ghstack-source-id: c1b6d39078d855e3f04cbe0d617353b5901fa54a Pull Request resolved: #804
fs.open()
in fsspec DataPipes
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you. And, could you please add corresponding tests, if possible?
Fixes #803 I left `FSSpecFileLister` untouched since I don't think it will be useful for `fs.ls()` to accept kwargs. Differential Revision: [D40038331](https://our.internmc.facebook.com/intern/diff/D40038331) [ghstack-poisoned]
ghstack-source-id: cafc1d14ee42d1d753987cb301617331d24f218c Pull Request resolved: #804
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@dorbittonn This PR should allow you to pass Note that the usage of these We are looking into the extending the same options for S3 DataPipes but it will be low priority for us if the |
Thanks for the tag @NivekT , but using s3fs library is not a good alternative for me, since this library is too slow. |
@dorbittonn We have recently ran our own benchmark on AWS EC2 to load data from S3, we actually found that |
Do you have any example for streaming data from s3 using fsspec using load from tar? |
This is what I have used, let me know if you need more info. dp = IterableWrapper(s3_paths) # Replace this with a DataPipe that gives up paths to S3 archives
dp = dp.open_files_by_fsspec(mode="rb", anon=True).load_from_tar(mode="r|") # Note that it is r| instead of r:, | enables streaming of uncompressed archives |
I get when trying to import torchdata after installing this PR |
In order to install TorchData from source, you will need to install PyTorch from source as well. |
Maybe try nightly release of torchdata? It should include nightly pytorch as well. |
BTW, I will cherry-pick this PR to the release branch by the end of today. |
Which version of pytorch should I install? the last commit? last release of 1.12.1 is enough? |
@dorbittonn Please check this link. There is a |
I tried it (with cpu and with nightly gpu by replacing the url you gave with https://download.pytorch.org/whl/nightly/torchrec_nightly_3.8_cu11.whl/ ) thanks for your help! |
Emmm. If you need to install Edit: And, this commit should be present in torchdata nightly. Could you provide the log when you execute |
Are you missing an underscore? It should be |
Summary: Pull Request resolved: #804 Fixes #803 I left `FSSpecFileLister` untouched since I don't think it will be useful for `fs.ls()` to accept kwargs. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D40038331 Pulled By: NivekT fbshipit-source-id: 45232b938693690bc0906fc6240a104e80ef51f9
Summary: Pull Request resolved: #804 Fixes #803 I left `FSSpecFileLister` untouched since I don't think it will be useful for `fs.ls()` to accept kwargs. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D40038331 Pulled By: NivekT fbshipit-source-id: 45232b938693690bc0906fc6240a104e80ef51f9
Stack from ghstack:
fs.open()
in fsspec DataPipes #804Fixes #803
I left
FSSpecFileLister
untouched since I don't think it will be useful forfs.ls()
to accept kwargs.Differential Revision: D40038331