Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataLoader2] Deep copy DataPipe during initialization #786

Closed
wants to merge 3 commits into from

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented Sep 22, 2022

Stack from ghstack:

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Differential Revision: D39741743


BC Breaking Note:

Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

Consider the code snippet below:

dp = IterableWrapper([0, 1, 2, 3, 4])
dl1 = DataLoader2(dp)
dl2 = DataLoader2(dp)
for x, y in zip(dl1, dl2):
    print(x, y)

Prior to this change, since the DataPipe objects are identical for each DataLoader2, it would've violated the single iterator constraint and raised an exception.

RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...

After this change, since the DataPipe objects are deep copied within the DataLoader2s, they are no longer the same object within each. You can iterate through them simultaneously without issue:

0 0
1 1
2 2
3 3
4 4

NivekT added a commit that referenced this pull request Sep 22, 2022
ghstack-source-id: 43fd0384f875c9c81a994de831951b114f8499c2
Pull Request resolved: #786
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 22, 2022
@NivekT
Copy link
Contributor Author

NivekT commented Sep 22, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@NivekT NivekT changed the title [DataLoader2] Deep copy datapipe during initialization [DataLoader2] Deep copy DataPipe during initialization Sep 22, 2022
@NivekT NivekT added the topic: improvements topic category label Sep 22, 2022
@NivekT
Copy link
Contributor Author

NivekT commented Sep 22, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Sep 30, 2022
ghstack-source-id: 2b8ef5829badd33c82870d4e1753021d4cbb55a6
Pull Request resolved: #786
@NivekT
Copy link
Contributor Author

NivekT commented Sep 30, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 5, 2022
ghstack-source-id: 62bcadc0aae4e352f39839566b6a3a8bf7f9490e
Pull Request resolved: #786
@NivekT
Copy link
Contributor Author

NivekT commented Oct 5, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan mentioned this pull request Oct 12, 2022
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 14, 2022
Summary:
Pull Request resolved: pytorch#786

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after pytorch#746 (the other PR will have to rebase).

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D39741743

Pulled By: ejguan

fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
ejguan pushed a commit that referenced this pull request Oct 14, 2022
Summary:
Pull Request resolved: #786

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D39741743

Pulled By: ejguan

fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
@facebook-github-bot facebook-github-bot deleted the gh/NivekT/93/head branch October 15, 2022 14:19
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 23, 2022
Summary:
Pull Request resolved: pytorch#786

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after pytorch#746 (the other PR will have to rebase).

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D39741743

Pulled By: ejguan

fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
@ejguan
Copy link
Contributor

ejguan commented Oct 25, 2022

@NivekT Do you think we need to add BC breaking note to this PR?

@NivekT
Copy link
Contributor Author

NivekT commented Oct 25, 2022

@NivekT Do you think we need to add BC breaking note to this PR?

Actually yes, I think we should, even though I don't think users would intentionally rely on the state of DataPipes being shared among DataLoaders.

@ejguan
Copy link
Contributor

ejguan commented Oct 25, 2022

@NivekT Do you think we need to add BC breaking note to this PR?

Actually yes, I think we should, even though I don't think users would intentionally rely on the state of DataPipes being shared among DataLoaders.

Can you add a brief example in the bc breaking note that the same DataPipe used by multiple DLv2?

@NivekT
Copy link
Contributor Author

NivekT commented Oct 25, 2022

Will do by the afternoon.

@ejguan
Copy link
Contributor

ejguan commented Oct 25, 2022

Will do by the afternoon.

Thank you so much!

@NivekT NivekT added the topic: bc_breaking topic category label Oct 25, 2022
@NivekT
Copy link
Contributor Author

NivekT commented Oct 25, 2022

@ejguan Updated this PR's description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: bc_breaking topic category topic: improvements topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants