-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add laion5b-example with Dataloader2 #1034
Conversation
To use the Dataloader2 I have to use
This causes the following problem: Traceback (most recent call last): File "\lib\multiprocessing\queues.py", line 244, in _feed obj = _ForkingPickler.dumps(obj) File "\lib\multiprocessing\reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) TypeError: cannot pickle '_io.BufferedReader' object r.raw doesn't serialize very well. We probably need to change this in HTTPReader in the future. It was originally introduced in #51 to allow using seek in TarArchiveLoader (#42)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. LGTM
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I am not sure why |
Previously So one needs to be careful when using |
I see. IIRC,
Make sense. TBH, in the majority of use cases, I would expect users run all decoding operations in worker processes. So, I would assume they don't have to pass |
Summary: This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing. ### Changes - Load metadata from Huggingface and filter - Load images from the urls - access metadata of image and print out label and copyright information Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR. Pull Request resolved: #1034 Reviewed By: NivekT Differential Revision: D43463022 Pulled By: ejguan fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
Summary: This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing. ### Changes - Load metadata from Huggingface and filter - Load images from the urls - access metadata of image and print out label and copyright information Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR. Pull Request resolved: #1034 Reviewed By: NivekT Differential Revision: D43463022 Pulled By: ejguan fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
This is an example that uses Datapipes to download and preprocess the laion5b-dataset (to be more precise this subset). Also uses Dataloader2 for multiprocessing.
Changes
Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR.