Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataLoader] Deep copy ReadingService during DL2 initialization #746

Closed
wants to merge 6 commits into from

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented Aug 18, 2022

Stack from ghstack:

By creating a deep copy of the input ReadingService during the initialization of DataLoader2, we allow the instance of ReadingService to store state without worrying about the users passing the same ReadingService to initialize additional DataLoaders.

For instance, PrototypeMultiProcessingReadingService stores processes and datapipes as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change.

Relevant test are added in subsequent PR related to lengths and eventually PrototypeMultiprocessingReadingService.

Differential Revision: D38868731


BC Breaking Note:

At the time when this PR landed, MultiProcessingReadingService was known as PrototypeMultiProcessingReadingService.

Previously, if a ReadingService is being passed to multiple DataLoaders, the ReadingService's state can be altered by any of those DataLoaders. This use case is not intended and can lead to silent incorrectness.

For instance, MultiProcessingReadingService stores processes and datapipes as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states.

Consider the code snippet below:

dp = IterableWrapper([0, 1, 2, 3, 4])
rs = MultiProcessingReadingService(num_workers=2)
dl1 = DataLoader2(dp, reading_service=rs)
dl2 = DataLoader2(dp, reading_service=rs)
next(iter(dl1))
print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
print(next(iter(dl2)))
# Note that we are still examining `dl1.read_service` below
print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")

Prior to this change, since the reading service objects are identical within each DataLoader2, initializing dl2 will change the state of the reading service that exists in dl1.

Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
Number of processes that exist in `dl1`'s RS after initializing `dl2`: 4

After this change, the reading service that exist in each DataLoader2 is independent and will not impact each other.

Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
Number of processes that exist in `dl1`'s RS after initializing `dl2`: 2

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 18, 2022
@NivekT NivekT added the topic: improvements topic category label Aug 18, 2022
@NivekT NivekT changed the title [DataLoader] Deep copy ReadingService during initialization [DataLoader] Deep copy ReadingService during DL2 initialization Aug 18, 2022
@@ -97,7 +111,7 @@ def __init__(
self.datapipe_adapter_fns = datapipe_adapter_fn
else:
self.datapipe_adapter_fns = [datapipe_adapter_fn]
self.reading_service = reading_service
self.reading_service = deepcopy(reading_service)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to import your PR to internal to validate if this is working properly with internal RS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on checking existing implementations.

@NivekT
Copy link
Contributor Author

NivekT commented Aug 19, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

We should put this information into the ReadingService interface file too. To make sure that RS developers know about potential deep copy.

@VitalyFedyunin
Copy link
Contributor

One more though, we deepcopy RS, but we pickle-unpickle DP. I think we need to pick consistent approach.

@ejguan
Copy link
Contributor

ejguan commented Aug 19, 2022

One more though, we deepcopy RS, but we pickle-unpickle DP. I think we need to pick consistent approach.

Totally agree. And, I support deepcopy because it's more readable and even working with unpickable objects.

@NivekT
Copy link
Contributor Author

NivekT commented Aug 22, 2022

One more though, we deepcopy RS, but we pickle-unpickle DP. I think we need to pick consistent approach.

Can you elaborate on this? What make a consistent approach matter here?

@NivekT
Copy link
Contributor Author

NivekT commented Aug 22, 2022

Also, it seems like some internal ReadingService cannot be deep-copied. One alternative is for users to pass in a constructor instead of the object itself, but that will change the interface quite a bit.

@ejguan
Copy link
Contributor

ejguan commented Aug 23, 2022

Also, it seems like some internal ReadingService cannot be deep-copied. One alternative is for users to pass in a constructor instead of the object itself, but that will change the interface quite a bit.

It seems the Error comes from construction of process group in __init__ function. You might be able to move them into initialize function to make it working.

…ation"


By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we can allow the instance of ReadingService to store state, without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders.

For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. In a single object of that ReadingService is used to initialize multiple DataLoaders, it could've led to conflicts and incorrect states. This should no longer be an issue with this change.

Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`

Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731)

[ghstack-poisoned]
…ation"


By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we can allow the instance of ReadingService to store state, without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders.

For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. In a single object of that ReadingService is used to initialize multiple DataLoaders, it could've led to conflicts and incorrect states. This should no longer be an issue with this change.

Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`

Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731)

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author

NivekT commented Aug 24, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

…ation"


By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders.

For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change.

Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`.

Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731)

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author

NivekT commented Aug 25, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving from OSS side. We might want @Miiira double check on the Diff.

And, could you please add a comment to ReadingServiceInterface that it's required to be deepcopied / picklable before initialize function.

…ation"


By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders.

For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change.

Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`.

Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731)

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author

NivekT commented Aug 25, 2022

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT added a commit that referenced this pull request Sep 30, 2022
…itialization"


Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Sep 30, 2022
Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 5, 2022
…itialization"


Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Oct 5, 2022
Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Oct 11, 2022
Summary:
Pull Request resolved: #786

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D39741743

Pulled By: ejguan

fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
ejguan pushed a commit to ejguan/data that referenced this pull request Oct 14, 2022
Summary:
Pull Request resolved: pytorch#786

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after pytorch#746 (the other PR will have to rebase).

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D39741743

Pulled By: ejguan

fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
ejguan pushed a commit that referenced this pull request Oct 14, 2022
Summary:
Pull Request resolved: #786

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after #746 (the other PR will have to rebase).

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D39741743

Pulled By: ejguan

fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
…ation"


By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders.

For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change.

Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`.

Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731)

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author

NivekT commented Oct 18, 2022

@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan pushed a commit to ejguan/data that referenced this pull request Oct 23, 2022
Summary:
Pull Request resolved: pytorch#786

Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages.

This can be merged before or after pytorch#746 (the other PR will have to rebase).

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D39741743

Pulled By: ejguan

fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
@NivekT NivekT added the topic: bc_breaking topic category label Oct 25, 2022
@facebook-github-bot facebook-github-bot deleted the gh/NivekT/89/head branch November 21, 2022 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: bc_breaking topic category topic: improvements topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants