Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determinsim about Local shuffle/random_op after sharding_filter #885

Closed
ejguan opened this issue Nov 4, 2022 · 0 comments
Closed

Determinsim about Local shuffle/random_op after sharding_filter #885

ejguan opened this issue Nov 4, 2022 · 0 comments

Comments

@ejguan
Copy link
Contributor

ejguan commented Nov 4, 2022

🐛 Describe the bug

Current state of determinism

Using DataLoader2 + PrototypeMultiProcessingReadingService as an example:

  1. Before each iteration starts, a distributed shared seed will be generated (link)
  2. With multiprocessing, each subprocess will reset all of shuffle operations to the same random seeds at the beginning of each iteration based on the distributed shared seed in step 1. (link)
  3. And, torch, numpy and python.random will get a different process-local seeds for each subprocess (link)

Additional feature

For the step 2 in the last section, we set the same shuffle seed across distribtued/mp workers because we want to make sure the shuffled data can be sharded in a mutually exclusive and collectively exhaustive manner.
An additional feature is needed to make sure all random operations after sharding_filter having the different seeds across workers to preserve fully data randomization.

Let's say we have a pipeline as:

data_source.shuffle().sharding_filter().map(fn).batch(8).shuffle()

We will have the random state shared for the first shuffle, but different states for the second shuffle. And, those states should be generated in a deterministic manner so we will be able to reproduce it.

Versions

main branch

cc: @msaroufim @VitalyFedyunin

facebook-github-bot pushed a commit that referenced this issue Nov 11, 2022
Summary:
Add a `list_dps` function to list `DataPipes` from the graph.
- It's similar to [`get_all_graph_pipes `](https://github.com/pytorch/pytorch/blob/896fa8c5c9b0191c9621e04ab5e20057614d48ad/torch/utils/data/graph_settings.py#L19) from pytorch core
- An extra argument of `exclude_dps` to exclude the `DataPipe` and its prior graph from the result.

Reason to add this function:
- It's required to set random states differently for DataPipe before/after `sharding_filter`
```py
graph = traverse_dps(datapipe)
sf_dps = find_dps(graph, ShardingFilter)

# DataPipes prior to `sharding_filter`
p_dps = []
for sf_dp in sf_dps:
    p_dps.extend(list_dps(traverse_dps(sf_dp)))

# DataPipes after `sharding_filter`
a_dps = list_dps(graph, exclude_dps=sf_dps)
```

Step 1 for #885

Pull Request resolved: #888

Reviewed By: VitalyFedyunin, NivekT

Differential Revision: D41099171

Pulled By: ejguan

fbshipit-source-id: d9d6e7beb498fea3921d8a3a1020649dd3955ce2
ejguan added a commit to ejguan/data that referenced this issue Jan 17, 2023
…vice (pytorch#801)

Summary:
Fixes pytorch#885

Pull Request resolved: pytorch#801

Add the support for DataLoader2 to control randomness over the pipeline:
- Implement `SeedGenerator`
  - `spawn` to generate sub-SeedGenerators for distributed workers
  - `generate_seed` to generate unique seeds
  - `generate_shared_seed` to generate distributed shared seeds
- Change API of `ReadingService` to take seed generator from DataLoader2. Then, the SeedGenerator of `DataLoader2` becomes the source of truth of randomness within the whole data pipeline.

A separate PR will be added for online doc regarding determinism.

Reviewed By: NivekT

Differential Revision: D38947827

fbshipit-source-id: e1a434460b4a5d43461e982debe875808b4241db
ejguan added a commit to ejguan/data that referenced this issue Jan 17, 2023
Summary:
Fixes pytorch#885

Add the support for DataLoader2 to control randomness over the pipeline:
- Implement SeedGenerator
  - `spawn` to generate sub-SeedGenerators for distributed workers
  - `generate_seed` to generate unique seeds
  - `generate_shared_seed` to generate distributed shared seeds
- Change API of ReadingService to take seed generator from DataLoader2. Then, the SeedGenerator of DataLoader2 becomes the source of truth of randomness within the whole data pipeline.

A separate PR will be added for online doc regarding determinism.

Last step for pytorch#885

Pull Request resolved: pytorch#801

Reviewed By: NivekT

Differential Revision: D38947827

Pulled By: ejguan

fbshipit-source-id: 006bf17cbb51b2d5a39d647ca86401b0483c7812
ejguan added a commit to ejguan/data that referenced this issue Jan 17, 2023
Summary:
Fixes pytorch#885

Add the support for DataLoader2 to control randomness over the pipeline:
- Implement SeedGenerator
  - `spawn` to generate sub-SeedGenerators for distributed workers
  - `generate_seed` to generate unique seeds
  - `generate_shared_seed` to generate distributed shared seeds
- Change API of ReadingService to take seed generator from DataLoader2. Then, the SeedGenerator of DataLoader2 becomes the source of truth of randomness within the whole data pipeline.

A separate PR will be added for online doc regarding determinism.

Last step for pytorch#885

Pull Request resolved: pytorch#801

Reviewed By: NivekT

Differential Revision: D38947827

Pulled By: ejguan

fbshipit-source-id: b6fa81de133a0613e8c96ce17b136d897ca80201
ejguan added a commit to ejguan/data that referenced this issue Jan 17, 2023
Summary:
Fixes pytorch#885

Add the support for DataLoader2 to control randomness over the pipeline:
- Implement SeedGenerator
  - `spawn` to generate sub-SeedGenerators for distributed workers
  - `generate_seed` to generate unique seeds
  - `generate_shared_seed` to generate distributed shared seeds
- Change API of ReadingService to take seed generator from DataLoader2. Then, the SeedGenerator of DataLoader2 becomes the source of truth of randomness within the whole data pipeline.

A separate PR will be added for online doc regarding determinism.

Last step for pytorch#885

Pull Request resolved: pytorch#801

Reviewed By: NivekT

Differential Revision: D38947827

Pulled By: ejguan

fbshipit-source-id: 2f852b89cb1d638e1b9222df838786eb8855afa4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant