Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standalone Lite: DDP Spawn Strategy Family #14675

Merged
merged 70 commits into from
Sep 15, 2022
Merged

Conversation

awaelchli
Copy link
Contributor

What does this PR do?

Adds the DDP Spawn family of strategies supported in Lite.
Very similar to #14670 , but using the spawn launchers.

Note: In the future, the spawn strategies will merge together with their non-spawn versions, as most logic related to process creation has been factored out to the launchers already.

Before submitting

  • Was this discussed/approved via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

awaelchli and others added 30 commits September 7, 2022 22:25
@mergify mergify bot removed the ready PRs ready to be merged label Sep 14, 2022
@mergify mergify bot added ready PRs ready to be merged has conflicts and removed has conflicts ready PRs ready to be merged labels Sep 15, 2022
@mergify mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Sep 15, 2022
src/lightning_lite/strategies/fairscale.py Outdated Show resolved Hide resolved
src/lightning_lite/strategies/fairscale.py Show resolved Hide resolved
src/lightning_lite/strategies/fairscale.py Outdated Show resolved Hide resolved
@awaelchli awaelchli enabled auto-merge (squash) September 15, 2022 10:36
@awaelchli awaelchli merged commit d3dcd68 into master Sep 15, 2022
@awaelchli awaelchli deleted the lite/strategies-spawn branch September 15, 2022 10:51
@@ -86,8 +86,7 @@ def _wrapping_function(
return_queue: SimpleQueue,
global_states: Optional[_GlobalStateSnapshot] = None,
) -> None:
# TODO(lite): Update worker setup once TPUSpawn strategy is in Lite
self._strategy._worker_setup(process_idx)
self._strategy._local_rank = process_idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli I believe this is not correct, as this no longer sets XLAStrategy._launched=True and then #14926 fails with

  File "/home/runner/work/lightning/tests/tests_lite/strategies/test_xla.py", line 17, in broadcast_on_tpu_fn
    result = strategy.broadcast(obj)
  File "/home/runner/work/lightning/src/lightning_lite/strategies/xla.py", line 146, in broadcast
    data_tensor = torch.tensor(data, device=self.root_device, dtype=torch.float)
  File "/home/runner/work/lightning/src/lightning_lite/strategies/xla.py", line 72, in root_device
    raise RuntimeError("Accessing the XLA device before processes have spawned is not allowed.")

I'm a bit confused about what's the best way to do this.

Do strategy.setup_environment()?

If yes:

  • This does not set the local rank. Should it? or do we still manually set the local rank?
  • Should this also be done for the other Lite launchers?

Also, why did Lite remove _worker_setup? The logic being different between PL and Lite strategies is confusing.

This blocks #14926

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done in reaction to your comment: #11073 (comment)
I think it is correct, otherwise many many tests for ddp spawn would fail, and tpu spawn is not fundamentally different regarding this local rank business.

I commented on #14926 that maybe all that is missing in the test is a strategy.setup_environment().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation with #11073 was always to simplify these things so that these questions wouldn't come up in the first place. But nobody wants to merge it lol, already posted 3x times in waiting pr over the last 5 months or so.

This does not set the local rank. Should it? or do we still manually set the local rank?

For the multiprocessing launcher, the information of local rank can only come from the launcher directly. So the answer here is no.

Should this also be done for the other Lite launchers?

If #11073 lands both codes would be identical in this regard.

Also, why did Lite remove _worker_setup? The logic being different between PL and Lite strategies is confusing.

If #11073 lands both codes would be identical in this regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric priority: 0 High priority task ready PRs ready to be merged refactor strategy
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants