Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ThreadMapperIterDatapipe #1045

Closed
SvenDS9 opened this issue Feb 24, 2023 · 1 comment
Closed

Add ThreadMapperIterDatapipe #1045

SvenDS9 opened this issue Feb 24, 2023 · 1 comment

Comments

@SvenDS9
Copy link
Contributor

SvenDS9 commented Feb 24, 2023

🚀 The feature

Similar to #1044 (thanks @ejguan!) I propose to add a new datapipe that uses ThreadPoolExecutor to multithread mapping.

Motivation, pitch

Speed up mapping by using Multithreading

Alternatives

Three possible implementations come to my mind.

Which option do you prefer? We can of course also implement e.g. both option 1 and 3.

Additional context

I am not sure how (if at all) the ThreadPoolExecutor interferes/interacts with multiprocessing used in the Dataloader.

@SvenDS9 SvenDS9 changed the title ThreadMapperDatapipe Add ThreadMapperIterDatapipe Feb 24, 2023
@ejguan
Copy link
Contributor

ejguan commented Feb 24, 2023

One disadvantage of this is that the first item can only be returned once all operations in the batch have finished.

Yeah. But, tbh, if we want to yield whenever the element is ready, users can always do dp.map(map_fn).prefetch(buffer_size).

I think we want to preserver order within batch (option 1) at least for now. Otherwise, the whole pipeline becomes nondeterministic. In the future, we might be able to design a mechanism like global switch to enable/disable deterministic training.

I am not sure how (if at all) the ThreadPoolExecutor interferes/interacts with multiprocessing used in the Dataloader.

I am not aware of any blocker for it. We can add more intensive tests to validate it.

As a follow up if you want to implement this DataPipe, we might want to do a benchmarking for LAION example between asyncio and threading. Then, we will be able to provide better recommendation for users on the choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants