Unblock ProtoMPRS to control determinism of DataPipe in single/multi-processing and dist/non-dist env #827

ejguan · 2022-10-12T22:54:14Z

This PR temporarily extend PrototypingMultiProcessingReadingService to fully control the determinism of the pipeline in the combinations of:

Single/Multi-processing
Distributed/Non-distributed
When we have SequentialReadingService ready to combine DistributedReadingService and PrototypingMultiProcessingReadingService, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model.

For multiprocessing case, it will set the same random seed for Shuffler and set different deterministic seeds for global RNGs including python.random, torch and numpy within each subprocess.
For distributed case, it will share the same random seed for Shuffler across all distributed subprocesses to guarantee the shuffle order before sharding.

Tests:
All tests are executed in the combinations of the above environments

Validate the same seed will generate the same order of data
Validate different seeds will generate different order of data
Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed

There is one missing test I will add tmrw

Validate subprocess-local RNGs like random, torch and numpy are properly set with different seeds.

facebook-github-bot · 2022-10-12T22:55:00Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ejguan · 2022-10-12T23:00:51Z

torchdata/dataloader2/communication/eventloop.py

    ):
        pass


-def SpawnProcessForDataPipeline(multiprocessing_ctx, datapipe, call_locally_fn=None, call_on_reset_epoch=None):
+def SpawnProcessForDataPipeline(multiprocessing_ctx, datapipe, call_on_process_init=None, call_on_epoch_reset=None):


I change these argument names to clarify the functionalities.

ejguan · 2022-10-12T23:01:58Z

torchdata/dataloader2/reading_service.py

@@ -174,38 +188,72 @@ def __init__(
        self.multiprocessing_context = multiprocessing_context
        self.processes = []
        self.datapipes = []
-        self.combined_datapipes = None
+        self.end_datapipe = None


I change it to end_datapipe because we need to store the last DataPipe for both In-process and multiprocessing cases.

VitalyFedyunin · 2022-10-12T23:08:26Z

torchdata/dataloader2/reading_service.py

+        )
+
+        # Multiprocessing (num_workers > 0)
+        if isinstance(self.end_datapipe, _IterateQueueDataPipes):


Will merge conflict with my prefetcher fix

I will rebase when your PR is landed. I still need to add a test for process-local RNGs

torchdata/dataloader2/reading_service.py

facebook-github-bot · 2022-10-13T19:27:07Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

VitalyFedyunin

LGTM

facebook-github-bot · 2022-10-18T21:20:02Z

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…processing and dist/non-dist env (pytorch#827) Summary: This PR temporarily extend `PrototypingMultiProcessingReadingService` to fully control the determinism of the pipeline in the combinations of: - Single/Multi-processing - Distributed/Non-distributed When we have `SequentialReadingService` ready to combine `DistributedReadingService` and `PrototypingMultiProcessingReadingService`, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model. For multiprocessing case, it will set the same random seed for `Shuffler` and set different deterministic seeds for global RNGs including `python.random`, `torch` and `numpy` within each subprocess. For distributed case, it will share the same random seed for `Shuffler` across all distributed subprocesses to guarantee the shuffle order before sharding. Test Plan: All tests are executed in the combinations of the above environments - [x] Validate the same seed will generate the same order of data - [x] Validate different seeds will generate different order of data - [x] Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed There is one missing test I will add tmrw - [x] Validate subprocess-local RNGs like `random`, `torch` and `numpy` are properly set with different seeds. Pull Request resolved: pytorch#827 Reviewed By: VitalyFedyunin, NivekT Differential Revision: D40323946 Pulled By: ejguan fbshipit-source-id: 2997d6d5dce87a6c38d5ebdf64a00f9769bb18fa

…processing and dist/non-dist env (#827) Summary: This PR temporarily extend `PrototypingMultiProcessingReadingService` to fully control the determinism of the pipeline in the combinations of: - Single/Multi-processing - Distributed/Non-distributed When we have `SequentialReadingService` ready to combine `DistributedReadingService` and `PrototypingMultiProcessingReadingService`, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model. For multiprocessing case, it will set the same random seed for `Shuffler` and set different deterministic seeds for global RNGs including `python.random`, `torch` and `numpy` within each subprocess. For distributed case, it will share the same random seed for `Shuffler` across all distributed subprocesses to guarantee the shuffle order before sharding. Test Plan: All tests are executed in the combinations of the above environments - [x] Validate the same seed will generate the same order of data - [x] Validate different seeds will generate different order of data - [x] Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed There is one missing test I will add tmrw - [x] Validate subprocess-local RNGs like `random`, `torch` and `numpy` are properly set with different seeds. Pull Request resolved: #827 Reviewed By: VitalyFedyunin, NivekT Differential Revision: D40323946 Pulled By: ejguan fbshipit-source-id: 2997d6d5dce87a6c38d5ebdf64a00f9769bb18fa

…processing and dist/non-dist env (pytorch#827) Summary: This PR temporarily extend `PrototypingMultiProcessingReadingService` to fully control the determinism of the pipeline in the combinations of: - Single/Multi-processing - Distributed/Non-distributed When we have `SequentialReadingService` ready to combine `DistributedReadingService` and `PrototypingMultiProcessingReadingService`, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model. For multiprocessing case, it will set the same random seed for `Shuffler` and set different deterministic seeds for global RNGs including `python.random`, `torch` and `numpy` within each subprocess. For distributed case, it will share the same random seed for `Shuffler` across all distributed subprocesses to guarantee the shuffle order before sharding. Test Plan: All tests are executed in the combinations of the above environments - [x] Validate the same seed will generate the same order of data - [x] Validate different seeds will generate different order of data - [x] Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed There is one missing test I will add tmrw - [x] Validate subprocess-local RNGs like `random`, `torch` and `numpy` are properly set with different seeds. Pull Request resolved: pytorch#827 Reviewed By: VitalyFedyunin, NivekT Differential Revision: D40323946 Pulled By: ejguan fbshipit-source-id: 2997d6d5dce87a6c38d5ebdf64a00f9769bb18fa

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 12, 2022

ejguan requested review from VitalyFedyunin and NivekT October 12, 2022 22:54

ejguan commented Oct 12, 2022

View reviewed changes

VitalyFedyunin reviewed Oct 12, 2022

View reviewed changes

torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved

VitalyFedyunin reviewed Oct 12, 2022

View reviewed changes

torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved

NivekT approved these changes Oct 13, 2022

View reviewed changes

torchdata/dataloader2/reading_service.py Show resolved Hide resolved

torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved

VitalyFedyunin reviewed Oct 13, 2022

View reviewed changes

ejguan added 9 commits October 18, 2022 20:52

Unblock ProtoRS to control determinism of DataPipe for dist/non-dist env

99916a6

Add distributed tests

f7f9107

Add single/multi-processing tests

6aa4fca

Revert MPI change

3e72af9

Revamp Distributed Tests

a8e9516

Add process-local RNG tests

c2ca3ef

Fix mypy CI

94edea7

Destroy process group properly

92a975c

Rebase

ad8da32

ejguan force-pushed the extend_proto_rs branch from 7cc917a to ad8da32 Compare October 18, 2022 21:16

ejguan requested a review from VitalyFedyunin October 18, 2022 21:22

VitalyFedyunin approved these changes Oct 18, 2022

View reviewed changes

NivekT approved these changes Oct 18, 2022

View reviewed changes

facebook-github-bot closed this in 649b893 Oct 19, 2022

This was referenced Oct 19, 2022

[v0.5.0] Release Tracker #805

Closed

Failing DataLoader2Integration Test #841

Closed

ejguan mentioned this pull request Oct 24, 2022

MultiProcessingReadingService is stuck for distributed training on windows #857

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unblock ProtoMPRS to control determinism of DataPipe in single/multi-processing and dist/non-dist env #827

Unblock ProtoMPRS to control determinism of DataPipe in single/multi-processing and dist/non-dist env #827

ejguan commented Oct 12, 2022 •

edited

Loading

facebook-github-bot commented Oct 12, 2022

ejguan Oct 12, 2022

ejguan Oct 12, 2022

VitalyFedyunin Oct 12, 2022

ejguan Oct 13, 2022

facebook-github-bot commented Oct 13, 2022

VitalyFedyunin left a comment

facebook-github-bot commented Oct 18, 2022

Unblock ProtoMPRS to control determinism of DataPipe in single/multi-processing and dist/non-dist env #827

Unblock ProtoMPRS to control determinism of DataPipe in single/multi-processing and dist/non-dist env #827

Conversation

ejguan commented Oct 12, 2022 • edited Loading

facebook-github-bot commented Oct 12, 2022

ejguan Oct 12, 2022

Choose a reason for hiding this comment

ejguan Oct 12, 2022

Choose a reason for hiding this comment

VitalyFedyunin Oct 12, 2022

Choose a reason for hiding this comment

ejguan Oct 13, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Oct 13, 2022

VitalyFedyunin left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 18, 2022

ejguan commented Oct 12, 2022 •

edited

Loading