Standalone Lite: DDP Spawn Strategy Family #14675

awaelchli · 2022-09-12T22:59:47Z

What does this PR do?

Adds the DDP Spawn family of strategies supported in Lite.
Very similar to #14670 , but using the spawn launchers.

Note: In the future, the spawn strategies will merge together with their non-spawn versions, as most logic related to process creation has been factored out to the launchers already.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

I made sure I had fun coding 🙃

for more information, see https://pre-commit.ci

Co-authored-by: Carlos Mocholí <[email protected]>

for more information, see https://pre-commit.ci

…elerators3

for more information, see https://pre-commit.ci

…elerators3

src/lightning_lite/strategies/fairscale.py

Co-authored-by: Carlos Mocholí <[email protected]>

carmocca · 2022-09-29T00:55:37Z

src/lightning_lite/strategies/launchers/xla.py

@@ -86,8 +86,7 @@ def _wrapping_function(
        return_queue: SimpleQueue,
        global_states: Optional[_GlobalStateSnapshot] = None,
    ) -> None:
-        # TODO(lite): Update worker setup once TPUSpawn strategy is in Lite
-        self._strategy._worker_setup(process_idx)
+        self._strategy._local_rank = process_idx


@awaelchli I believe this is not correct, as this no longer sets XLAStrategy._launched=True and then #14926 fails with

File "/home/runner/work/lightning/tests/tests_lite/strategies/test_xla.py", line 17, in broadcast_on_tpu_fn result = strategy.broadcast(obj) File "/home/runner/work/lightning/src/lightning_lite/strategies/xla.py", line 146, in broadcast data_tensor = torch.tensor(data, device=self.root_device, dtype=torch.float) File "/home/runner/work/lightning/src/lightning_lite/strategies/xla.py", line 72, in root_device raise RuntimeError("Accessing the XLA device before processes have spawned is not allowed.")

I'm a bit confused about what's the best way to do this.

Do strategy.setup_environment()?

If yes:

This does not set the local rank. Should it? or do we still manually set the local rank?

Should this also be done for the other Lite launchers?

Also, why did Lite remove _worker_setup? The logic being different between PL and Lite strategies is confusing.

This blocks #14926

This was done in reaction to your comment: #11073 (comment)
I think it is correct, otherwise many many tests for ddp spawn would fail, and tpu spawn is not fundamentally different regarding this local rank business.

I commented on #14926 that maybe all that is missing in the test is a strategy.setup_environment().

My motivation with #11073 was always to simplify these things so that these questions wouldn't come up in the first place. But nobody wants to merge it lol, already posted 3x times in waiting pr over the last 5 months or so.

This does not set the local rank. Should it? or do we still manually set the local rank?

For the multiprocessing launcher, the information of local rank can only come from the launcher directly. So the answer here is no.

Should this also be done for the other Lite launchers?

If #11073 lands both codes would be identical in this regard.

Also, why did Lite remove _worker_setup? The logic being different between PL and Lite strategies is confusing.

If #11073 lands both codes would be identical in this regard.

awaelchli and others added 30 commits September 7, 2022 22:25

add accelerator implementations to lite

fe59302

[pre-commit.ci] auto fixes from pre-commit.com hooks

7271f94

for more information, see https://pre-commit.ci

fix imports

b6de11f

rename registry argument

2ef04e6

fix test

9bbaf4f

fix tests

48bc1e8

Merge branch 'master' into lite/accelerators3

0cf9651

remove duplicated test

dc09055

[pre-commit.ci] auto fixes from pre-commit.com hooks

6a14975

for more information, see https://pre-commit.ci

fix tests

e6d619c

deprecation

9055717

deprecations

f016626

flake8

084bc6f

fixes

9c19b48

add mps to runif

3d09dac

fix tests

7a5a740

[pre-commit.ci] auto fixes from pre-commit.com hooks

de78087

for more information, see https://pre-commit.ci

Apply suggestions from code review

48ef646

Co-authored-by: Carlos Mocholí <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6d60b96

for more information, see https://pre-commit.ci

remove more

4e018c4

[pre-commit.ci] auto fixes from pre-commit.com hooks

983a6d7

for more information, see https://pre-commit.ci

local import

2220350

Merge remote-tracking branch 'origin/lite/accelerators' into lite/acc…

cfce27e

…elerators3

undo device stats :(

4ba5809

fix import

231d8c3

stupid typehints

6e1f03a

[pre-commit.ci] auto fixes from pre-commit.com hooks

1505eb4

for more information, see https://pre-commit.ci

Merge branch 'master' into lite/accelerators

334e3cf

more refactors :(

e832e67

Merge remote-tracking branch 'origin/lite/accelerators' into lite/acc…

a90ef22

…elerators3

mergify bot removed the ready PRs ready to be merged label Sep 14, 2022

awaelchli added 7 commits September 15, 2022 02:22

remove deprecated pg backend logic

07a2956

wip

3685ba1

integrate changes from #11073

d4f3a54

rename TPUSpawnStrategy to XLAStrategy

7d4f5b6

add back missing method

2fd9d73

Merge branch 'master' into lite/strategies-spawn

5236eef

isort

b5dd25d

mergify bot added ready PRs ready to be merged has conflicts and removed has conflicts ready PRs ready to be merged labels Sep 15, 2022

awaelchli added 2 commits September 15, 2022 03:45

Merge branch 'master' into lite/strategies-spawn

0b473a6

import

78c96b1

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Sep 15, 2022

carmocca approved these changes Sep 15, 2022

View reviewed changes

src/lightning_lite/strategies/fairscale.py Outdated Show resolved Hide resolved

src/lightning_lite/strategies/fairscale.py Show resolved Hide resolved

src/lightning_lite/strategies/fairscale.py Outdated Show resolved Hide resolved

awaelchli and others added 3 commits September 14, 2022 22:06

Apply suggestions from code review

d7e5db9

Co-authored-by: Carlos Mocholí <[email protected]>

Update src/lightning_lite/strategies/ddp_spawn.py

04f3f78

Co-authored-by: Carlos Mocholí <[email protected]>

made sharded implementations identical

185dd64

awaelchli enabled auto-merge (squash) September 15, 2022 10:36

awaelchli merged commit d3dcd68 into master Sep 15, 2022

awaelchli deleted the lite/strategies-spawn branch September 15, 2022 10:51

This was referenced Sep 15, 2022

Fairscale import updates #14721

Merged

Standalone Lite: Update LightningLite #14726

Merged

carmocca reviewed Sep 29, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standalone Lite: DDP Spawn Strategy Family #14675

Standalone Lite: DDP Spawn Strategy Family #14675

awaelchli commented Sep 12, 2022

carmocca Sep 29, 2022

awaelchli Sep 29, 2022

awaelchli Sep 29, 2022

Standalone Lite: DDP Spawn Strategy Family #14675

Standalone Lite: DDP Spawn Strategy Family #14675

Conversation

awaelchli commented Sep 12, 2022

What does this PR do?

Before submitting

PR review

Did you have fun?

carmocca Sep 29, 2022

Choose a reason for hiding this comment

awaelchli Sep 29, 2022

Choose a reason for hiding this comment

awaelchli Sep 29, 2022

Choose a reason for hiding this comment