-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataLoader] Deep copy ReadingService during DL2 initialization #746
Conversation
[ghstack-poisoned]
torchdata/dataloader2/dataloader2.py
Outdated
@@ -97,7 +111,7 @@ def __init__( | |||
self.datapipe_adapter_fns = datapipe_adapter_fn | |||
else: | |||
self.datapipe_adapter_fns = [datapipe_adapter_fn] | |||
self.reading_service = reading_service | |||
self.reading_service = deepcopy(reading_service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to import your PR to internal to validate if this is working properly with internal RS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on checking existing implementations.
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
We should put this information into the ReadingService interface file too. To make sure that RS developers know about potential deep copy. |
One more though, we deepcopy RS, but we pickle-unpickle DP. I think we need to pick consistent approach. |
Totally agree. And, I support |
Can you elaborate on this? What make a consistent approach matter here? |
Also, it seems like some internal ReadingService cannot be deep-copied. One alternative is for users to pass in a constructor instead of the object itself, but that will change the interface quite a bit. |
It seems the Error comes from construction of process group in |
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we can allow the instance of ReadingService to store state, without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. In a single object of that ReadingService is used to initialize multiple DataLoaders, it could've led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService` Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we can allow the instance of ReadingService to store state, without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. In a single object of that ReadingService is used to initialize multiple DataLoaders, it could've led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService` Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`. Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving from OSS side. We might want @Miiira double check on the Diff.
And, could you please add a comment to ReadingServiceInterface
that it's required to be deepcopied / picklable before initialize
function.
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`. Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…itialization" Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743) [ghstack-poisoned]
Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743) [ghstack-poisoned]
…itialization" Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743) [ghstack-poisoned]
Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Differential Revision: [D39741743](https://our.internmc.facebook.com/intern/diff/D39741743) [ghstack-poisoned]
Summary: Pull Request resolved: #786 Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D39741743 Pulled By: ejguan fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
Summary: Pull Request resolved: pytorch#786 Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after pytorch#746 (the other PR will have to rebase). Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D39741743 Pulled By: ejguan fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
Summary: Pull Request resolved: #786 Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after #746 (the other PR will have to rebase). Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D39741743 Pulled By: ejguan fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
…ation" By creating a deep copy of the input `ReadingService` during the initialization of `DataLoader2`, we allow the instance of `ReadingService` to store state without worrying about the users passing the same `ReadingService` to initialize additional DataLoaders. For instance, `PrototypeMultiProcessingReadingService` stores `processes` and `datapipes` as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change. Relevant test are added in subsequent PR related to lengths and eventually `PrototypeMultiprocessingReadingService`. Differential Revision: [D38868731](https://our.internmc.facebook.com/intern/diff/D38868731) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Pull Request resolved: pytorch#786 Deep copy DataPipe during initialization to avoid sharing states if the same DataPipe is being passed to multiple usages. This can be merged before or after pytorch#746 (the other PR will have to rebase). Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D39741743 Pulled By: ejguan fbshipit-source-id: f1cdaca054e10bcb3672857b933732178b370535
Stack from ghstack:
By creating a deep copy of the input
ReadingService
during the initialization ofDataLoader2
, we allow the instance ofReadingService
to store state without worrying about the users passing the sameReadingService
to initialize additional DataLoaders.For instance,
PrototypeMultiProcessingReadingService
storesprocesses
anddatapipes
as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states. This should no longer be an issue with this change.Relevant test are added in subsequent PR related to lengths and eventually
PrototypeMultiprocessingReadingService
.Differential Revision: D38868731
BC Breaking Note:
At the time when this PR landed,
MultiProcessingReadingService
was known asPrototypeMultiProcessingReadingService
.Previously, if a ReadingService is being passed to multiple DataLoaders, the ReadingService's state can be altered by any of those DataLoaders. This use case is not intended and can lead to silent incorrectness.
For instance,
MultiProcessingReadingService
storesprocesses
anddatapipes
as instance variables. If a single object of that ReadingService is used to initialize multiple DataLoaders, it can led to conflicts and incorrect states.Consider the code snippet below:
Prior to this change, since the reading service objects are identical within each DataLoader2, initializing
dl2
will change the state of the reading service that exists indl1
.After this change, the reading service that exist in each DataLoader2 is independent and will not impact each other.