Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unblock ProtoMPRS to control determinism of DataPipe in single/multi-processing and dist/non-dist env #827

Closed
wants to merge 9 commits into from

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Oct 12, 2022

This PR temporarily extend PrototypingMultiProcessingReadingService to fully control the determinism of the pipeline in the combinations of:

  • Single/Multi-processing
  • Distributed/Non-distributed
    When we have SequentialReadingService ready to combine DistributedReadingService and PrototypingMultiProcessingReadingService, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model.

For multiprocessing case, it will set the same random seed for Shuffler and set different deterministic seeds for global RNGs including python.random, torch and numpy within each subprocess.
For distributed case, it will share the same random seed for Shuffler across all distributed subprocesses to guarantee the shuffle order before sharding.

Tests:
All tests are executed in the combinations of the above environments

  • Validate the same seed will generate the same order of data
  • Validate different seeds will generate different order of data
  • Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed

There is one missing test I will add tmrw

  • Validate subprocess-local RNGs like random, torch and numpy are properly set with different seeds.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 12, 2022
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

):
pass


def SpawnProcessForDataPipeline(multiprocessing_ctx, datapipe, call_locally_fn=None, call_on_reset_epoch=None):
def SpawnProcessForDataPipeline(multiprocessing_ctx, datapipe, call_on_process_init=None, call_on_epoch_reset=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change these argument names to clarify the functionalities.

@@ -174,38 +188,72 @@ def __init__(
self.multiprocessing_context = multiprocessing_context
self.processes = []
self.datapipes = []
self.combined_datapipes = None
self.end_datapipe = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change it to end_datapipe because we need to store the last DataPipe for both In-process and multiprocessing cases.

)

# Multiprocessing (num_workers > 0)
if isinstance(self.end_datapipe, _IterateQueueDataPipes):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge conflict with my prefetcher fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will rebase when your PR is landed. I still need to add a test for process-local RNGs

torchdata/dataloader2/reading_service.py Show resolved Hide resolved
torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan added a commit to ejguan/data that referenced this pull request Oct 21, 2022
…processing and dist/non-dist env (pytorch#827)

Summary:
This PR temporarily extend `PrototypingMultiProcessingReadingService` to fully control the determinism of the pipeline in the combinations of:
- Single/Multi-processing
- Distributed/Non-distributed
When we have `SequentialReadingService` ready to combine `DistributedReadingService` and `PrototypingMultiProcessingReadingService`, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model.

For multiprocessing case, it will set the same random seed for `Shuffler` and set different deterministic seeds for global RNGs including `python.random`, `torch` and `numpy` within each subprocess.
For distributed case, it will share the same random seed for `Shuffler` across all distributed subprocesses to guarantee the shuffle order before sharding.

Test Plan:
All tests are executed in the combinations of the above environments
- [x] Validate the same seed will generate the same order of data
- [x] Validate different seeds will generate different order of data
- [x] Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed

There is one missing test I will add tmrw
- [x] Validate subprocess-local RNGs like `random`, `torch` and `numpy` are properly set with different seeds.

Pull Request resolved: pytorch#827

Reviewed By: VitalyFedyunin, NivekT

Differential Revision: D40323946

Pulled By: ejguan

fbshipit-source-id: 2997d6d5dce87a6c38d5ebdf64a00f9769bb18fa
ejguan added a commit that referenced this pull request Oct 21, 2022
…processing and dist/non-dist env (#827)

Summary:
This PR temporarily extend `PrototypingMultiProcessingReadingService` to fully control the determinism of the pipeline in the combinations of:
- Single/Multi-processing
- Distributed/Non-distributed
When we have `SequentialReadingService` ready to combine `DistributedReadingService` and `PrototypingMultiProcessingReadingService`, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model.

For multiprocessing case, it will set the same random seed for `Shuffler` and set different deterministic seeds for global RNGs including `python.random`, `torch` and `numpy` within each subprocess.
For distributed case, it will share the same random seed for `Shuffler` across all distributed subprocesses to guarantee the shuffle order before sharding.

Test Plan:
All tests are executed in the combinations of the above environments
- [x] Validate the same seed will generate the same order of data
- [x] Validate different seeds will generate different order of data
- [x] Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed

There is one missing test I will add tmrw
- [x] Validate subprocess-local RNGs like `random`, `torch` and `numpy` are properly set with different seeds.

Pull Request resolved: #827

Reviewed By: VitalyFedyunin, NivekT

Differential Revision: D40323946

Pulled By: ejguan

fbshipit-source-id: 2997d6d5dce87a6c38d5ebdf64a00f9769bb18fa
ejguan added a commit to ejguan/data that referenced this pull request Oct 23, 2022
…processing and dist/non-dist env (pytorch#827)

Summary:
This PR temporarily extend `PrototypingMultiProcessingReadingService` to fully control the determinism of the pipeline in the combinations of:
- Single/Multi-processing
- Distributed/Non-distributed
When we have `SequentialReadingService` ready to combine `DistributedReadingService` and `PrototypingMultiProcessingReadingService`, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model.

For multiprocessing case, it will set the same random seed for `Shuffler` and set different deterministic seeds for global RNGs including `python.random`, `torch` and `numpy` within each subprocess.
For distributed case, it will share the same random seed for `Shuffler` across all distributed subprocesses to guarantee the shuffle order before sharding.

Test Plan:
All tests are executed in the combinations of the above environments
- [x] Validate the same seed will generate the same order of data
- [x] Validate different seeds will generate different order of data
- [x] Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed

There is one missing test I will add tmrw
- [x] Validate subprocess-local RNGs like `random`, `torch` and `numpy` are properly set with different seeds.

Pull Request resolved: pytorch#827

Reviewed By: VitalyFedyunin, NivekT

Differential Revision: D40323946

Pulled By: ejguan

fbshipit-source-id: 2997d6d5dce87a6c38d5ebdf64a00f9769bb18fa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants