Support sharded optimizers outside of DDP sharded strategy #11867

DuYicong515 · 2022-02-11T00:52:54Z

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

This check was removed.
https://github.com/PyTorchLightning/pytorch-lightning/blob/6f22b3623c28028026b3cb8bb534c1ebca9c5ac8/pytorch_lightning/strategies/sharded.py#L88-L90

However, the check was also removed in sharded_spawn,py, also the optimizer type hint is "OSS", this might not happen.
https://github.com/PyTorchLightning/pytorch-lightning/blob/6f22b3623c28028026b3cb8bb534c1ebca9c5ac8/pytorch_lightning/strategies/sharded_spawn.py#L72-L75

The original PR link that removed the check

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
[N/A] Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
[N/A] Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

tests/strategies/test_ddp_strategy.py

CHANGELOG.md

ananthsub · 2022-02-12T17:46:18Z

pytorch_lightning/strategies/strategy.py

+    def optimizer_state(self, optimizer: Optimizer) -> Optional[Dict[str, Tensor]]:
        """Returns state of an optimizer.

        Allows for syncing/collating optimizer state from processes in custom plugins.
        """
+        if (_TORCH_GREATER_EQUAL_1_10 and isinstance(optimizer, ZeroRedundancyOptimizer)) or (
+            _FAIRSCALE_AVAILABLE and isinstance(optimizer, OSS)
+        ):
+            optimizer.consolidate_state_dict()
+            # only call state_dict on the rank where the states were consolidated
+            return self._rank_zero_only_optim_state_dict(optimizer)
+        else:
+            return optimizer.state_dict()
+
+    @rank_zero_only
+    def _rank_zero_only_optim_state_dict(self, optimizer):
+        """
+        Retrieves state dict only on rank 0, which contains the entire optimizer state after calling
+        :meth:`consolidate_state_dict`.
+        """
        return optimizer.state_dict()



optimizers can return a nested dict in their state dict due to the parameter groups. the typehint here isn't correct. let's not change the return type to be optional either

Suggested change

def optimizer_state(self, optimizer: Optimizer) -> Optional[Dict[str, Tensor]]:

"""Returns state of an optimizer.

Allows for syncing/collating optimizer state from processes in custom plugins.

"""

if (_TORCH_GREATER_EQUAL_1_10 and isinstance(optimizer, ZeroRedundancyOptimizer)) or (

_FAIRSCALE_AVAILABLE and isinstance(optimizer, OSS)

):

optimizer.consolidate_state_dict()

# only call state_dict on the rank where the states were consolidated

return self._rank_zero_only_optim_state_dict(optimizer)

else:

return optimizer.state_dict()

@rank_zero_only

def _rank_zero_only_optim_state_dict(self, optimizer):

"""

Retrieves state dict only on rank 0, which contains the entire optimizer state after calling

:meth:`consolidate_state_dict`.

"""

return optimizer.state_dict()

def optimizer_state(self, optimizer: Optimizer) -> Dict[str, Any]:

"""Returns state of an optimizer.

Allows for syncing/collating optimizer state from processes in custom plugins.

"""

if (_TORCH_GREATER_EQUAL_1_10 and isinstance(optimizer, ZeroRedundancyOptimizer)) or (

_FAIRSCALE_AVAILABLE and isinstance(optimizer, OSS)

):

optimizer.consolidate_state_dict()

# only call state_dict on the rank where the states were consolidated

return optimizer.state_dict() if self.is_global_zero else {}

else:

return optimizer.state_dict()

Thanks for the suggestion!

One question: wouldn't this different from the original behavior of optimizer_state function of DDPShardedStrategy, which returns None when rank!=0
https://github.com/PyTorchLightning/pytorch-lightning/blob/79c4e5de60685dbc895641b0139ffc6180d069aa/pytorch_lightning/strategies/sharded.py#L88-L100

Yes, though this is visible to users as the checkpoint is ultimately only saved from rank 0, which contains all the states

Thanks! Addressed the comment. BTW do you mean the return value change in non-rank-0 state dict is invisible to users?

Co-authored-by: ananthsub <[email protected]>

…orch-lightning into bug/optimizer_state_dict

DuYicong515 · 2022-02-18T05:26:34Z

Will redo once #11952 gets merged

DuYicong515 commented Feb 11, 2022

View reviewed changes

tests/strategies/test_ddp_strategy.py Outdated Show resolved Hide resolved

ananthsub mentioned this pull request Feb 11, 2022

Move model to device before creating optimizers in DDP #11886

Merged

10 tasks

DuYicong515 force-pushed the bug/optimizer_state_dict branch from db55ef2 to 180ac6a Compare February 11, 2022 20:59

DuYicong515 commented Feb 11, 2022

View reviewed changes

tests/strategies/test_ddp_strategy.py Outdated Show resolved Hide resolved

DuYicong515 added 4 commits February 11, 2022 14:18

support optimizer state for sharded optimizers

4675abc

remove sharded optimizer custom handeling in sharded strategies

9dec3b5

multi gpu zero redundancy optimizer test

b00a4cb

add doc and multi-gpu ZeroRedundancyOptimizer test with ddp strategy

57d6ece

DuYicong515 force-pushed the bug/optimizer_state_dict branch from 180ac6a to 57d6ece Compare February 11, 2022 22:46

remove import os

f21169f

DuYicong515 commented Feb 11, 2022

View reviewed changes

tests/strategies/test_ddp_strategy.py Show resolved Hide resolved

DuYicong515 marked this pull request as ready for review February 11, 2022 22:58

DuYicong515 requested review from tchaton, SeanNaren, awaelchli, justusschock, kaushikb11, williamFalcon, Borda, carmocca and rohitgr7 as code owners February 11, 2022 22:58

carmocca changed the base branch from master to ananthsub-patch-1 February 12, 2022 04:12

ananthsub reviewed Feb 12, 2022

View reviewed changes

DuYicong515 and others added 4 commits February 14, 2022 10:13

Update CHANGELOG.md

5cfcaf5

Co-authored-by: ananthsub <[email protected]>

address comments

e9ad5c6

Merge branch 'bug/optimizer_state_dict' of github.com:DuYicong515/pyt…

9588478

…orch-lightning into bug/optimizer_state_dict

fix test nits

d49ece3

carmocca added this to the 1.5.x milestone Feb 16, 2022

DuYicong515 requested a review from ananthsub February 16, 2022 18:25

ananthsub deleted the branch Lightning-AI:ananthsub-patch-1 February 17, 2022 06:11

ananthsub closed this Feb 17, 2022

mergify bot added the has conflicts label Feb 17, 2022

awaelchli mentioned this pull request Aug 15, 2022

Support sharded optimizer state dumping outside of sharded strategies #14208

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sharded optimizers outside of DDP sharded strategy #11867

Support sharded optimizers outside of DDP sharded strategy #11867

DuYicong515 commented Feb 11, 2022 •

edited

Loading

ananthsub Feb 12, 2022

DuYicong515 Feb 14, 2022 •

edited

Loading

ananthsub Feb 14, 2022

DuYicong515 Feb 14, 2022

DuYicong515 commented Feb 18, 2022

Support sharded optimizers outside of DDP sharded strategy #11867

Support sharded optimizers outside of DDP sharded strategy #11867

Conversation

DuYicong515 commented Feb 11, 2022 • edited Loading

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

ananthsub Feb 12, 2022

Choose a reason for hiding this comment

DuYicong515 Feb 14, 2022 • edited Loading

Choose a reason for hiding this comment

ananthsub Feb 14, 2022

Choose a reason for hiding this comment

DuYicong515 Feb 14, 2022

Choose a reason for hiding this comment

DuYicong515 commented Feb 18, 2022

DuYicong515 commented Feb 11, 2022 •

edited

Loading

DuYicong515 Feb 14, 2022 •

edited

Loading