Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add laion5b-example with Dataloader2 #1034

Closed
wants to merge 1 commit into from

Conversation

SvenDS9
Copy link
Contributor

@SvenDS9 SvenDS9 commented Feb 20, 2023

This is an example that uses Datapipes to download and preprocess the laion5b-dataset (to be more precise this subset). Also uses Dataloader2 for multiprocessing.

Changes

  • Load metadata from Huggingface and filter
  • Load images from the urls
  • access metadata of image and print out label and copyright information

Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 20, 2023
@SvenDS9
Copy link
Contributor Author

SvenDS9 commented Feb 20, 2023

To use the Dataloader2 I have to use r.content instead of:

return url, StreamWrapper(r.raw)

This causes the following problem:
Traceback (most recent call last):
File "\lib\multiprocessing\queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_io.BufferedReader' object
r.raw doesn't serialize very well. We probably need to change this in HTTPReader in the future. It was originally introduced in #51 to allow using seek in TarArchiveLoader (#42)

@SvenDS9 SvenDS9 mentioned this pull request Feb 20, 2023
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. LGTM

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan
Copy link
Contributor

ejguan commented Feb 21, 2023

To use the Dataloader2 I have to use r.content instead of:

return url, StreamWrapper(r.raw)

This causes the following problem:
Traceback (most recent call last):
File "\lib\multiprocessing\queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_io.BufferedReader' object
r.raw doesn't serialize very well. We probably need to change this in HTTPReader in the future. It was originally introduced in #51 to allow using seek in TarArchiveLoader (#42)

I am not sure why r.raw is sent through mp.queue. The whole pipeline is supposed to stay in worker processes, then r.raw has been processed by load_image. The only thing being passed from worker process to main process is the batch of data, right?

@SvenDS9
Copy link
Contributor Author

SvenDS9 commented Feb 21, 2023

I am not sure why r.raw is sent through mp.queue. The whole pipeline is supposed to stay in worker processes, then r.raw has been processed by load_image. The only thing being passed from worker process to main process is the batch of data, right?

Previously load_image() didn't call PIL.Image.open() but just returned the result of _get_response_from_http() or None if it failed to load. Therefore the batches still contained the r.raw objects. If the batches contain r.content instead, you do not run into any issues.

So one needs to be careful when using HTTPReader with multiprocessing and process the result before passing it to another process or else one will run into issues.

@ejguan
Copy link
Contributor

ejguan commented Feb 21, 2023

Previously load_image() didn't call PIL.Image.open() but just returned the result of _get_response_from_http() or None if it failed to load. Therefore the batches still contained the r.raw objects. If the batches contain r.content instead, you do not run into any issues.

I see. IIRC, r.content basically would download every byte into memory, which would eliminate this problem.

So one needs to be careful when using HTTPReader with multiprocessing and process the result before passing it to another process or else one will run into issues.

Make sense. TBH, in the majority of use cases, I would expect users run all decoding operations in worker processes. So, I would assume they don't have to pass r.raw from worker to main.

@facebook-github-bot
Copy link
Contributor

@ejguan merged this pull request in 6ca4402.

@NivekT NivekT mentioned this pull request Feb 21, 2023
10 tasks
NivekT pushed a commit that referenced this pull request Feb 21, 2023
Summary:
This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing.

### Changes
- Load metadata from Huggingface and filter
- Load images from the urls
- access metadata of image and print out label and copyright information

Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR.

Pull Request resolved: #1034

Reviewed By: NivekT

Differential Revision: D43463022

Pulled By: ejguan

fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
ejguan pushed a commit that referenced this pull request Feb 22, 2023
Summary:
This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing.

### Changes
- Load metadata from Huggingface and filter
- Load images from the urls
- access metadata of image and print out label and copyright information

Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR.

Pull Request resolved: #1034

Reviewed By: NivekT

Differential Revision: D43463022

Pulled By: ejguan

fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants