Add laion5b-example with Dataloader2 #1034

SvenDS9 · 2023-02-20T16:44:07Z

This is an example that uses Datapipes to download and preprocess the laion5b-dataset (to be more precise this subset). Also uses Dataloader2 for multiprocessing.

Changes

Load metadata from Huggingface and filter
Load images from the urls
access metadata of image and print out label and copyright information

Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR.

SvenDS9 · 2023-02-20T16:48:49Z

To use the Dataloader2 I have to use r.content instead of:

data/torchdata/datapipes/iter/load/online.py

Line 41 in a3b34a0

return url, StreamWrapper(r.raw)

This causes the following problem:
Traceback (most recent call last):
File "\lib\multiprocessing\queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_io.BufferedReader' object
r.raw doesn't serialize very well. We probably need to change this in HTTPReader in the future. It was originally introduced in #51 to allow using seek in TarArchiveLoader (#42)

ejguan

Thank you. LGTM

facebook-github-bot · 2023-02-21T15:00:17Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2023-02-21T15:03:16Z

To use the Dataloader2 I have to use r.content instead of:

data/torchdata/datapipes/iter/load/online.py

Line 41 in a3b34a0

return url, StreamWrapper(r.raw)

This causes the following problem:
Traceback (most recent call last):
File "\lib\multiprocessing\queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "\lib\multiprocessing\reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_io.BufferedReader' object
r.raw doesn't serialize very well. We probably need to change this in HTTPReader in the future. It was originally introduced in #51 to allow using seek in TarArchiveLoader (#42)

I am not sure why r.raw is sent through mp.queue. The whole pipeline is supposed to stay in worker processes, then r.raw has been processed by load_image. The only thing being passed from worker process to main process is the batch of data, right?

SvenDS9 · 2023-02-21T15:43:15Z

I am not sure why r.raw is sent through mp.queue. The whole pipeline is supposed to stay in worker processes, then r.raw has been processed by load_image. The only thing being passed from worker process to main process is the batch of data, right?

Previously load_image() didn't call PIL.Image.open() but just returned the result of _get_response_from_http() or None if it failed to load. Therefore the batches still contained the r.raw objects. If the batches contain r.content instead, you do not run into any issues.

So one needs to be careful when using HTTPReader with multiprocessing and process the result before passing it to another process or else one will run into issues.

ejguan · 2023-02-21T16:13:53Z

Previously load_image() didn't call PIL.Image.open() but just returned the result of _get_response_from_http() or None if it failed to load. Therefore the batches still contained the r.raw objects. If the batches contain r.content instead, you do not run into any issues.

I see. IIRC, r.content basically would download every byte into memory, which would eliminate this problem.

So one needs to be careful when using HTTPReader with multiprocessing and process the result before passing it to another process or else one will run into issues.

Make sense. TBH, in the majority of use cases, I would expect users run all decoding operations in worker processes. So, I would assume they don't have to pass r.raw from worker to main.

facebook-github-bot · 2023-02-21T16:59:50Z

@ejguan merged this pull request in 6ca4402.

Summary: This is an example that uses Datapipes to download and preprocess the [laion5b](https://laion.ai/blog/laion-5b/)-dataset (to be more precise [this subset](https://huggingface.co/datasets/laion/laion2B-en-joined)). Also uses Dataloader2 for multiprocessing. ### Changes - Load metadata from Huggingface and filter - Load images from the urls - access metadata of image and print out label and copyright information Unfortunately I made a mistake while rebasing in #1017 so I had to reopen the PR. Pull Request resolved: #1034 Reviewed By: NivekT Differential Revision: D43463022 Pulled By: ejguan fbshipit-source-id: 2f1f2b8bcb3abee15a1935431a497532b95b1c8d

Add laion5b-example with Dataloader2

1e5dcfe

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 20, 2023

SvenDS9 mentioned this pull request Feb 20, 2023

Laion5b dataset example #1017

Closed

ejguan approved these changes Feb 21, 2023

View reviewed changes

facebook-github-bot closed this in 6ca4402 Feb 21, 2023

facebook-github-bot added the Merged label Feb 21, 2023

NivekT mentioned this pull request Feb 21, 2023

[v0.6.0] Release Tracker #1025

Closed

10 tasks

NivekT mentioned this pull request Feb 21, 2023

[Release 0.6.0] Add laion5b-example with Dataloader2 (#1034) #1037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add laion5b-example with Dataloader2 #1034

Add laion5b-example with Dataloader2 #1034

SvenDS9 commented Feb 20, 2023

SvenDS9 commented Feb 20, 2023 •

edited

Loading

ejguan left a comment

facebook-github-bot commented Feb 21, 2023

ejguan commented Feb 21, 2023

SvenDS9 commented Feb 21, 2023 •

edited

Loading

ejguan commented Feb 21, 2023

facebook-github-bot commented Feb 21, 2023

Add laion5b-example with Dataloader2 #1034

Add laion5b-example with Dataloader2 #1034

Conversation

SvenDS9 commented Feb 20, 2023

Changes

SvenDS9 commented Feb 20, 2023 • edited Loading

ejguan left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Feb 21, 2023

ejguan commented Feb 21, 2023

SvenDS9 commented Feb 21, 2023 • edited Loading

ejguan commented Feb 21, 2023

facebook-github-bot commented Feb 21, 2023

SvenDS9 commented Feb 20, 2023 •

edited

Loading

SvenDS9 commented Feb 21, 2023 •

edited

Loading