This repository has been archived by the owner on Jul 1, 2024. It is now read-only.

Optimizer state sharding - Not landing as-is, early feedback #584

Closed

blefaudeux wants to merge 1 commit into facebookresearch:master from blefaudeux:export-D22518768

Contributor

blefaudeux commented Jul 31, 2020

Summary:
Bringing in fairscale to provide an optional state sharded optimizer in Classy, which should help in situations bounded by memory pressure.
No new communication backend, this is using vanilla torch.distributed

Differential Revision: D22518768

facebook-github-bot added CLA Signed fb-exported labels

Contributor

facebook-github-bot commented Jul 31, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

8 similar comments

Contributor

facebook-github-bot commented Jul 31, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

Contributor

facebook-github-bot commented Aug 13, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

Contributor

facebook-github-bot commented Aug 14, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

Contributor

facebook-github-bot commented Aug 14, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

Contributor

facebook-github-bot commented Aug 14, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

Contributor

facebook-github-bot commented Aug 14, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

Contributor

facebook-github-bot commented Aug 14, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

Contributor

facebook-github-bot commented Aug 20, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

Contributor Author

blefaudeux commented Aug 20, 2020 •

edited

Loading

needs facebookresearch/fairscale#46


          Optimizer state sharding - Fairscale (#584)

4ad8382

Summary:
Pull Request resolved: #584

Bringing in fairscale to provide an optional state sharded optimizer in Classy, which should help in situations bounded by memory pressure.
No new communication backend, this is using vanilla torch.distributed.

See ZeRO for more context https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/

KNOWN TODOs:
[-] huge memory discrepancy in between the two runs
[-] huge speed discrepancy
-> these probably come from the many small broadcasts, will be improved on the fairscale side, not related to this diff (T71319397 and facebookresearch/fairscale#42)

[x] final accuracy in the same ballpark but very different behaviours, could be some settings not properly passed down, an issue with LARC, or the parameter scheduling
-> this was due to the LR not properly adjusted, fixed since

[x] sync with min-xu-ai to use a proper gradient dispatch in the end, not landing anything before that
-> done by min-xu-ai on the fairscale side, needs benchmarking, but should not be related to this diff (no interface consequence hopefully)

Differential Revision: D22518768

fbshipit-source-id: ea79e3561580e21030123dca299f5c935ee971f4

Contributor

facebook-github-bot commented Sep 1, 2020

This pull request was exported from Phabricator. Differential Revision: D22518768

facebook-github-bot closed this in

ac2993d

Contributor

facebook-github-bot commented Sep 12, 2020

This pull request has been merged in ac2993d.

facebook-github-bot added the Merged label

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed fb-exported Merged