-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deepspeed support #817
Comments
Forgive me if I'm wrong, but doesn't Lightning already provide many functions supported by DeepSpeed? Also, going by a cursory reading of DeepSpeed, isn't it just another wrapper for Pytorch? Or am I wrong? |
i haven’t had a chance to read in depth but this is likely a library that operates on top of models which means lightning can use it |
I think it's something like Lightning with more features based on the CIFAR example. One way is to make Lightning dependent on DeepSpeed for training related stuffs while Lightning focuses on reproducibility. |
It's not like lightning at all lol. It's more like apex... or ddp. To add support in lightning we need to create a flag and follow the readme instructions: https://github.com/microsoft/DeepSpeed ApiCreate a deepspeed object for the configs. Code changesWhen the flag is enabled
The trainer does the following: 1. Init model, optimizers (like amp)
2. do a slightly different forward (like ddp)Note: We need to forward to training_step, validation_step and test_step accordingly. See DDP override.
3. do a slightly different thing for checkpoint saving
4. 16-bit and ddpWe need to make sure when deepspeed is enabled to defer to the library so it can handle 16-bit and ddp. 5. set up config automaticallySince the trainer flags have most of what's needed, we can automatically set up the config for the user (https://github.com/microsoft/DeepSpeed#deepspeed-configuration).
@neggert @jeffling anyone want to take this? |
@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄 Awesome job! |
@williamFalcon okay my bad I didn't read through it properly. Btw, I think the Training Optimizers, Advanced Parameter Search and Simplified Data Loader seems like good features to be included into Lightning if DeepSpeed backend is used. Or is it better for user to manually call it using the DeepSpeed library? |
@xingzhaolee these are all features we should automatically enable when someone uses the deepspeed backend. We should also make that configurable so users can modify it if they want to:
Then if you want a different way of doing this, override this function and add your own version. |
@williamFalcon Thanks for reaching out to us, this could be great. We are having internal discussions about how to proceed and will get back to you soon. We're also in the process of learning more about Lightning, it looks like great work you all have done :) |
DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):
It's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:
I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :) |
sounds great! looking forward to the v1 ;)
|
FYI we have an implementation of the optimizer side in https://github.com/facebookresearch/fairscale/blob/master/fairscale/optim/oss.py, compatible with standard pytorch (ie. same param groups for instance, so that schedulers don't see a change). The issue with any implementation is that to get the full benefits you need to change the way the DP engine works though, that's true of 1) and 2) above. If you keep the normal pytorch DDP then the gradients are all-reduced and you waste some traffic. cc @ananthsub |
thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation? |
we could drop the v1 to use the non optimized version first? then quickly move to a v2 where we modify the ddp stuff as well? we already have lightningddp which modifies the original ddp a bit. |
(with the standard DDP, using the linked optimizer as a drop-in "replacement" -more, wrap- to a normal optimizer) couple of percents if multi node, but that would depend on the interconnect. intra node it's actually faster on top of saving memory. Now with more custom DDP like what deepspeed is doing there's a lot of potential in terms of speed and memory, but it's a little more compllicated to integrate, working on it. I can mention pytorch/pytorch#42849 and pytorch/pytorch#37002 here, ideally it would be nice to be able to customize the communication patterns without duplicating/forking |
I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there! |
You can save a bit more if you own the communications, because you can release the gradients as soon as they have been reduced to the appropriate rank, that's 2) basically. So 1) is drop-in (usable with normal DDP and you get some savings), you can get a 1.5 of sorts by releasing all the now-useless gradients at the beginning of the sharded optimizer step (that's what the fairscale implementation above does), and 2) is when you drop the gradients as soon as possible, earlier. example, toy problem training a RN101 on 4 gpus, first is DDP, second is OSS+DDP, third is OSS+custom DDP (the losses should be exactly the same, fixing that) |
Thanks @blefaudeux! We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further. Out of curiosity has there been any progress integrating gradient/parameter partitioning? |
of course !
You might need to sync with Ananth (@ananthsub), within FB there's already a lightning/fairscale integration running, could be worth it unifying the efforts ?
I've an experimental branch which gives these results currently (last chunk. 'OSS experimental'), following the ideas presented in this RFC (split the model in chunks, use autograd hooks to load/drop the parameters on the fly while keeping reasonably close to pytorch, each rank owns the optimization for one chunk only), very much WIP though. The savings depend a lot on the model size and optimizer, and with this test problem the activations dominate actually so it's not the most impressive usecase (still useful). |
Just an update if anyone is tracking this, we technically haven't gotten 'DeepSpeed' support but that's primarily a design choice as the upstream API can be improved to not detriment user experience. What this means is currently FairScale which has been integrated into lightning some time provides most of the features whilst being accessible to all lightning modules in different domains, and I highly suggest looking at our sharded training as a replacement. There are some exciting improvements coming up as well to continue the memory/speed efficiency in FairScale and from other integrations :) |
Is there also going to be support for ZeRO-Offload? https://www.deepspeed.ai/tutorials/zero-offload/ Or does this also depend on FairScale implementing it? In case, I created a feature request here: facebookresearch/fairscale#337 |
An update here! DeepSpeed finally has been integrated as a plugin into Lightning, see our docs here. We've worked hard to make the API flexible whilst reducing friction as much as possible. If you see run into any problems please leave an issue or message us on our PyTorch Lightning slack channel!
There already is, and it's the staple feature with presets out the box, so you don't need to modify your code to use it (for most cases), we also give instructions to tune to reasonable parameters. Currently the plugin is available from PyTorch Lightning master, but we'll be releasing 1.2 soon with the feature with technical details and benchmarks soon to come. Currently the plugin does not support multiple optimizers, so you'll need to fallback on Sharded Training as we add this support onto DeepSpeed! |
This is super exciting!! :) thanks for your contributions to DeepSpeed and all your work getting DeepSpeed integrated into lightning! I think we have a chat coming up soon, we would love to hear more about any kinks you discovered with DeepSpeed and how we can iron those out together. Also especially curious on details regarding PyTorch incompatibilities with DeepSpeed? |
So epic to see you here @jeffra you and your team has done amazing work, I can't wait to see what you guys come up with, and hope we can assist in your work! Thank you for your kind words but it's really your team that did all the work :) I'll be hopefully pushing a few PRs to DeepSpeed/Lightning to ease integration (a lot of the issues are kinks we need to iron out on our end in PyTorch Lightning). These range from small issues like configuration of the throughput timer, to more involved changes like multi-optimizer/multi-scheduler support or allowing lightning to control Apex/AMP settings outside of DeepSpeed. I'll be tracking these via issues so we can iterate on them and make the integration even better. I've been using DeepSpeed for a while now across a multitude of models and it's been an incredible experience with the level of customisability. I'm glad we're moving towards the community being able to fine-tune/train larger models! |
Thanks for the kind words @SeanNaren! :) Very much looking forward to a great collaboration going forward. |
Let's support this!
https://github.com/microsoft/DeepSpeed
The text was updated successfully, but these errors were encountered: