-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determinsim about Local shuffle/random_op after sharding_filter
#885
Comments
facebook-github-bot
pushed a commit
that referenced
this issue
Nov 11, 2022
Summary: Add a `list_dps` function to list `DataPipes` from the graph. - It's similar to [`get_all_graph_pipes `](https://github.com/pytorch/pytorch/blob/896fa8c5c9b0191c9621e04ab5e20057614d48ad/torch/utils/data/graph_settings.py#L19) from pytorch core - An extra argument of `exclude_dps` to exclude the `DataPipe` and its prior graph from the result. Reason to add this function: - It's required to set random states differently for DataPipe before/after `sharding_filter` ```py graph = traverse_dps(datapipe) sf_dps = find_dps(graph, ShardingFilter) # DataPipes prior to `sharding_filter` p_dps = [] for sf_dp in sf_dps: p_dps.extend(list_dps(traverse_dps(sf_dp))) # DataPipes after `sharding_filter` a_dps = list_dps(graph, exclude_dps=sf_dps) ``` Step 1 for #885 Pull Request resolved: #888 Reviewed By: VitalyFedyunin, NivekT Differential Revision: D41099171 Pulled By: ejguan fbshipit-source-id: d9d6e7beb498fea3921d8a3a1020649dd3955ce2
ejguan
added a commit
to ejguan/data
that referenced
this issue
Jan 17, 2023
…vice (pytorch#801) Summary: Fixes pytorch#885 Pull Request resolved: pytorch#801 Add the support for DataLoader2 to control randomness over the pipeline: - Implement `SeedGenerator` - `spawn` to generate sub-SeedGenerators for distributed workers - `generate_seed` to generate unique seeds - `generate_shared_seed` to generate distributed shared seeds - Change API of `ReadingService` to take seed generator from DataLoader2. Then, the SeedGenerator of `DataLoader2` becomes the source of truth of randomness within the whole data pipeline. A separate PR will be added for online doc regarding determinism. Reviewed By: NivekT Differential Revision: D38947827 fbshipit-source-id: e1a434460b4a5d43461e982debe875808b4241db
ejguan
added a commit
to ejguan/data
that referenced
this issue
Jan 17, 2023
Summary: Fixes pytorch#885 Add the support for DataLoader2 to control randomness over the pipeline: - Implement SeedGenerator - `spawn` to generate sub-SeedGenerators for distributed workers - `generate_seed` to generate unique seeds - `generate_shared_seed` to generate distributed shared seeds - Change API of ReadingService to take seed generator from DataLoader2. Then, the SeedGenerator of DataLoader2 becomes the source of truth of randomness within the whole data pipeline. A separate PR will be added for online doc regarding determinism. Last step for pytorch#885 Pull Request resolved: pytorch#801 Reviewed By: NivekT Differential Revision: D38947827 Pulled By: ejguan fbshipit-source-id: 006bf17cbb51b2d5a39d647ca86401b0483c7812
ejguan
added a commit
to ejguan/data
that referenced
this issue
Jan 17, 2023
Summary: Fixes pytorch#885 Add the support for DataLoader2 to control randomness over the pipeline: - Implement SeedGenerator - `spawn` to generate sub-SeedGenerators for distributed workers - `generate_seed` to generate unique seeds - `generate_shared_seed` to generate distributed shared seeds - Change API of ReadingService to take seed generator from DataLoader2. Then, the SeedGenerator of DataLoader2 becomes the source of truth of randomness within the whole data pipeline. A separate PR will be added for online doc regarding determinism. Last step for pytorch#885 Pull Request resolved: pytorch#801 Reviewed By: NivekT Differential Revision: D38947827 Pulled By: ejguan fbshipit-source-id: b6fa81de133a0613e8c96ce17b136d897ca80201
ejguan
added a commit
to ejguan/data
that referenced
this issue
Jan 17, 2023
Summary: Fixes pytorch#885 Add the support for DataLoader2 to control randomness over the pipeline: - Implement SeedGenerator - `spawn` to generate sub-SeedGenerators for distributed workers - `generate_seed` to generate unique seeds - `generate_shared_seed` to generate distributed shared seeds - Change API of ReadingService to take seed generator from DataLoader2. Then, the SeedGenerator of DataLoader2 becomes the source of truth of randomness within the whole data pipeline. A separate PR will be added for online doc regarding determinism. Last step for pytorch#885 Pull Request resolved: pytorch#801 Reviewed By: NivekT Differential Revision: D38947827 Pulled By: ejguan fbshipit-source-id: 2f852b89cb1d638e1b9222df838786eb8855afa4
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
🐛 Describe the bug
Current state of determinism
Using
DataLoader2
+PrototypeMultiProcessingReadingService
as an example:torch
,numpy
andpython.random
will get a different process-local seeds for each subprocess (link)Additional feature
For the step 2 in the last section, we set the same shuffle seed across distribtued/mp workers because we want to make sure the shuffled data can be sharded in a mutually exclusive and collectively exhaustive manner.
An additional feature is needed to make sure all random operations after
sharding_filter
having the different seeds across workers to preserve fully data randomization.Let's say we have a pipeline as:
We will have the random state shared for the first
shuffle
, but different states for the secondshuffle
. And, those states should be generated in a deterministic manner so we will be able to reproduce it.Versions
main branch
cc: @msaroufim @VitalyFedyunin
The text was updated successfully, but these errors were encountered: