-
Notifications
You must be signed in to change notification settings - Fork 277
Optimizer state sharding - Not landing as-is, early feedback #584
Conversation
This pull request was exported from Phabricator. Differential Revision: D22518768 |
8 similar comments
This pull request was exported from Phabricator. Differential Revision: D22518768 |
This pull request was exported from Phabricator. Differential Revision: D22518768 |
This pull request was exported from Phabricator. Differential Revision: D22518768 |
This pull request was exported from Phabricator. Differential Revision: D22518768 |
This pull request was exported from Phabricator. Differential Revision: D22518768 |
This pull request was exported from Phabricator. Differential Revision: D22518768 |
This pull request was exported from Phabricator. Differential Revision: D22518768 |
This pull request was exported from Phabricator. Differential Revision: D22518768 |
Summary: Pull Request resolved: #584 Bringing in fairscale to provide an optional state sharded optimizer in Classy, which should help in situations bounded by memory pressure. No new communication backend, this is using vanilla torch.distributed. See ZeRO for more context https://www.microsoft.com/en-us/research/blog/zero-2-deepspeed-shattering-barriers-of-deep-learning-speed-scale/ KNOWN TODOs: [-] huge memory discrepancy in between the two runs [-] huge speed discrepancy -> these probably come from the many small broadcasts, will be improved on the fairscale side, not related to this diff (T71319397 and facebookresearch/fairscale#42) [x] final accuracy in the same ballpark but very different behaviours, could be some settings not properly passed down, an issue with LARC, or the parameter scheduling -> this was due to the LR not properly adjusted, fixed since [x] sync with min-xu-ai to use a proper gradient dispatch in the end, not landing anything before that -> done by min-xu-ai on the fairscale side, needs benchmarking, but should not be related to this diff (no interface consequence hopefully) Differential Revision: D22518768 fbshipit-source-id: ea79e3561580e21030123dca299f5c935ee971f4
This pull request was exported from Phabricator. Differential Revision: D22518768 |
This pull request has been merged in ac2993d. |
Summary:
Bringing in fairscale to provide an optional state sharded optimizer in Classy, which should help in situations bounded by memory pressure.
No new communication backend, this is using vanilla torch.distributed
Differential Revision: D22518768