-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standalone Lite: DDP Spawn Strategy Family #14675
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Co-authored-by: Carlos Mocholí <[email protected]>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Co-authored-by: Carlos Mocholí <[email protected]>
Co-authored-by: Carlos Mocholí <[email protected]>
@@ -86,8 +86,7 @@ def _wrapping_function( | |||
return_queue: SimpleQueue, | |||
global_states: Optional[_GlobalStateSnapshot] = None, | |||
) -> None: | |||
# TODO(lite): Update worker setup once TPUSpawn strategy is in Lite | |||
self._strategy._worker_setup(process_idx) | |||
self._strategy._local_rank = process_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli I believe this is not correct, as this no longer sets XLAStrategy._launched=True
and then #14926 fails with
File "/home/runner/work/lightning/tests/tests_lite/strategies/test_xla.py", line 17, in broadcast_on_tpu_fn
result = strategy.broadcast(obj)
File "/home/runner/work/lightning/src/lightning_lite/strategies/xla.py", line 146, in broadcast
data_tensor = torch.tensor(data, device=self.root_device, dtype=torch.float)
File "/home/runner/work/lightning/src/lightning_lite/strategies/xla.py", line 72, in root_device
raise RuntimeError("Accessing the XLA device before processes have spawned is not allowed.")
I'm a bit confused about what's the best way to do this.
Do strategy.setup_environment()
?
If yes:
- This does not set the local rank. Should it? or do we still manually set the local rank?
- Should this also be done for the other Lite launchers?
Also, why did Lite remove _worker_setup
? The logic being different between PL and Lite strategies is confusing.
This blocks #14926
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was done in reaction to your comment: #11073 (comment)
I think it is correct, otherwise many many tests for ddp spawn would fail, and tpu spawn is not fundamentally different regarding this local rank business.
I commented on #14926 that maybe all that is missing in the test is a strategy.setup_environment().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My motivation with #11073 was always to simplify these things so that these questions wouldn't come up in the first place. But nobody wants to merge it lol, already posted 3x times in waiting pr over the last 5 months or so.
This does not set the local rank. Should it? or do we still manually set the local rank?
For the multiprocessing launcher, the information of local rank can only come from the launcher directly. So the answer here is no.
Should this also be done for the other Lite launchers?
If #11073 lands both codes would be identical in this regard.
Also, why did Lite remove _worker_setup? The logic being different between PL and Lite strategies is confusing.
If #11073 lands both codes would be identical in this regard.
What does this PR do?
Adds the DDP Spawn family of strategies supported in Lite.
Very similar to #14670 , but using the spawn launchers.
Note: In the future, the spawn strategies will merge together with their non-spawn versions, as most logic related to process creation has been factored out to the launchers already.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
I made sure I had fun coding 🙃