Add deepspeed support #817

williamFalcon · 2020-02-11T13:04:32Z

Let's support this!

https://github.com/microsoft/DeepSpeed

sudarshan85 · 2020-02-11T14:25:05Z

Forgive me if I'm wrong, but doesn't Lightning already provide many functions supported by DeepSpeed? Also, going by a cursory reading of DeepSpeed, isn't it just another wrapper for Pytorch? Or am I wrong?

williamFalcon · 2020-02-11T14:43:02Z

i haven’t had a chance to read in depth but this is likely a library that operates on top of models which means lightning can use it

ghost · 2020-02-12T01:31:46Z

I think it's something like Lightning with more features based on the CIFAR example. One way is to make Lightning dependent on DeepSpeed for training related stuffs while Lightning focuses on reproducibility.

williamFalcon · 2020-02-12T01:47:23Z

It's not like lightning at all lol. It's more like apex... or ddp.

To add support in lightning we need to create a flag and follow the readme instructions:

https://github.com/microsoft/DeepSpeed

Api

Create a deepspeed object for the configs.

Code changes

When the flag is enabled

Trainer(distributed_backend='deepspeed')

OR 
Trainer(backend_engine='deepspeed')

The trainer does the following:

1. Init model, optimizers (like amp)

model_engine, optimizer, _, _ = deepspeed.initialize(args=cmd_args,
                                                     model=model,
                                                     model_parameters=params)

2. do a slightly different forward (like ddp)

Note: We need to forward to training_step, validation_step and test_step accordingly. See DDP override.

for step, batch in enumerate(data_loader):
    #forward() method
    loss = model_engine(batch)

    #runs backpropagation
    model_engine.backward(loss)

    #weight update
    model_engine.step()

3. do a slightly different thing for checkpoint saving

        model_engine.save_checkpoint(args.save_dir, ckpt_id, client_sd = client_sd)

4. 16-bit and ddp

We need to make sure when deepspeed is enabled to defer to the library so it can handle 16-bit and ddp.

5. set up config automatically

Since the trainer flags have most of what's needed, we can automatically set up the config for the user (https://github.com/microsoft/DeepSpeed#deepspeed-configuration).

{
  "train_batch_size": 8,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": true,
  "disable_allgather": true,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.00015,
      "max_grad_norm": 1.0
    }
  },

  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  }
}

@neggert @jeffling anyone want to take this?
@luiscape might be a good issue to try?
@Borda also a good issue to start with

williamFalcon · 2020-02-12T01:53:42Z

@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄

Awesome job!

ghost · 2020-02-12T03:09:25Z

@williamFalcon okay my bad I didn't read through it properly. Btw, I think the Training Optimizers, Advanced Parameter Search and Simplified Data Loader seems like good features to be included into Lightning if DeepSpeed backend is used. Or is it better for user to manually call it using the DeepSpeed library?

williamFalcon · 2020-02-12T03:11:44Z

@xingzhaolee these are all features we should automatically enable when someone uses the deepspeed backend.

We should also make that configurable so users can modify it if they want to:

def configure_deepspeed(self, ...):
   # do auto setup stuff for users

Then if you want a different way of doing this, override this function and add your own version.

jeffra · 2020-02-13T19:09:27Z

@jeffra, @ShadenSmith, @samyam, anyone interested in adding this to lightning? 😄

Awesome job!

@williamFalcon Thanks for reaching out to us, this could be great. We are having internal discussions about how to proceed and will get back to you soon. We're also in the process of learning more about Lightning, it looks like great work you all have done :)

SeanNaren · 2020-10-14T18:29:42Z

DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py

Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):

1) Optimizer State Partitioning (Pos): 4x memory reduction, same communication volume
as DP;
2) Add Gradient Partitioning (Pos+g): 8x memory reduction, same communication volume
as DP;
3) Add Parameter Partitioning (Pos+g+p): Memory reduction is linear with DP degree Nd.
For example, splitting across 64 GPUs (Nd = 64) will yield a 64x memory reduction. There is
a modest 50% increase in communication volume

It's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:

model = MegatronLM() # too_big_for_single_gpu_training

Trainer(
	accelerator=‘ddp’,
	num_gpus=2
)
trainer.fit(model) # Crashes because of CUDA out of memory

Trainer(
	accelerator=‘deepspeed’,
	num_gpus=2
)

trainer.fit(model) # Actually trains!

I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :)

javismiles · 2020-10-14T18:33:35Z

sounds great! looking forward to the v1 ;)

yup! we're actively working on this. Expect a v1 of it in the next few weeks via an rc. (cc @SeanNaren )

blefaudeux · 2020-10-14T19:52:02Z

DeepSpeed is made up of many components as @williamFalcon said above, but I think we're only after a few pieces here. ZERO optimization is the key piece here and the code can be seen here: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage2.py

Stage 1 and 2 have been released for ZERO optimization, and for reference (from paper here):
1) Optimizer State Partitioning (Pos): 4x memory reduction, same communication volume
as DP;
2) Add Gradient Partitioning (Pos+g): 8x memory reduction, same communication volume
as DP;
3) Add Parameter Partitioning (Pos+g+p): Memory reduction is linear with DP degree Nd.
For example, splitting across 64 GPUs (Nd = 64) will yield a 64x memory reduction. There is
a modest 50% increase in communication volume
It's called an optimizer but it's a bit more involved. The goal API is to create an accelerator that encompasses this functionality, something like below:
model = MegatronLM() # too_big_for_single_gpu_training

Trainer(
	accelerator=‘ddp’,
	num_gpus=2
)
trainer.fit(model) # Crashes because of CUDA out of memory

Trainer(
	accelerator=‘deepspeed’,
	num_gpus=2
)

trainer.fit(model) # Actually trains!
I'm currently stepping through the optimizer and separating it from the fp16 components to play nice with native amp. If anyone is interested message me or comment! Happy to collab :)

FYI we have an implementation of the optimizer side in https://github.com/facebookresearch/fairscale/blob/master/fairscale/optim/oss.py, compatible with standard pytorch (ie. same param groups for instance, so that schedulers don't see a change). The issue with any implementation is that to get the full benefits you need to change the way the DP engine works though, that's true of 1) and 2) above. If you keep the normal pytorch DDP then the gradients are all-reduced and you waste some traffic. cc @ananthsub

SeanNaren · 2020-10-14T20:41:37Z

thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation?

williamFalcon · 2020-10-14T20:45:26Z

we could drop the v1 to use the non optimized version first? then quickly move to a v2 where we modify the ddp stuff as well?

we already have lightningddp which modifies the original ddp a bit.

blefaudeux · 2020-10-14T20:46:09Z

thanks @blefaudeux! That's a clean implementation :) have you seen a large performance degradation using the fairscale + ddp implementation?

(with the standard DDP, using the linked optimizer as a drop-in "replacement" -more, wrap- to a normal optimizer) couple of percents if multi node, but that would depend on the interconnect. intra node it's actually faster on top of saving memory.

Now with more custom DDP like what deepspeed is doing there's a lot of potential in terms of speed and memory, but it's a little more compllicated to integrate, working on it. I can mention pytorch/pytorch#42849 and pytorch/pytorch#37002 here, ideally it would be nice to be able to customize the communication patterns without duplicating/forking

SeanNaren · 2020-10-14T20:57:41Z

I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there!

blefaudeux · 2020-10-14T21:04:50Z

I assume the memory saving is pretty much the same? I think that's definitely key, so as @williamFalcon said we could start from there!

You can save a bit more if you own the communications, because you can release the gradients as soon as they have been reduced to the appropriate rank, that's 2) basically. So 1) is drop-in (usable with normal DDP and you get some savings), you can get a 1.5 of sorts by releasing all the now-useless gradients at the beginning of the sharded optimizer step (that's what the fairscale implementation above does), and 2) is when you drop the gradients as soon as possible, earlier. example, toy problem training a RN101 on 4 gpus, first is DDP, second is OSS+DDP, third is OSS+custom DDP (the losses should be exactly the same, fixing that)

SeanNaren · 2020-10-14T21:51:38Z

Thanks @blefaudeux! We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further. Out of curiosity has there been any progress integrating gradient/parameter partitioning?

blefaudeux · 2020-10-14T22:02:32Z

Thanks @blefaudeux!

of course !

We'll get the fairscale OSS integrated into a new accelerator, then look towards DDP changes to reduce the overhead further.

You might need to sync with Ananth (@ananthsub), within FB there's already a lightning/fairscale integration running, could be worth it unifying the efforts ?

Out of curiosity has there been any progress integrating gradient/parameter partitioning?

I've an experimental branch which gives these results currently (last chunk. 'OSS experimental'), following the ideas presented in this RFC (split the model in chunks, use autograd hooks to load/drop the parameters on the fly while keeping reasonably close to pytorch, each rank owns the optimization for one chunk only), very much WIP though. The savings depend a lot on the model size and optimizer, and with this test problem the activations dominate actually so it's not the most impressive usecase (still useful).

SeanNaren · 2021-01-17T17:41:17Z

Just an update if anyone is tracking this, we technically haven't gotten 'DeepSpeed' support but that's primarily a design choice as the upstream API can be improved to not detriment user experience.

What this means is currently FairScale which has been integrated into lightning some time provides most of the features whilst being accessible to all lightning modules in different domains, and I highly suggest looking at our sharded training as a replacement. There are some exciting improvements coming up as well to continue the memory/speed efficiency in FairScale and from other integrations :)

Spenhouet · 2021-01-28T07:19:42Z

Is there also going to be support for ZeRO-Offload?

https://www.deepspeed.ai/tutorials/zero-offload/

Or does this also depend on FairScale implementing it? In case, I created a feature request here: facebookresearch/fairscale#337

SeanNaren · 2021-02-17T23:31:30Z

An update here! DeepSpeed finally has been integrated as a plugin into Lightning, see our docs here. We've worked hard to make the API flexible whilst reducing friction as much as possible.

If you see run into any problems please leave an issue or message us on our PyTorch Lightning slack channel!

Is there also going to be support for ZeRO-Offload?

There already is, and it's the staple feature with presets out the box, so you don't need to modify your code to use it (for most cases), we also give instructions to tune to reasonable parameters.

Currently the plugin is available from PyTorch Lightning master, but we'll be releasing 1.2 soon with the feature with technical details and benchmarks soon to come.

Currently the plugin does not support multiple optimizers, so you'll need to fallback on Sharded Training as we add this support onto DeepSpeed!

williamFalcon · 2021-02-17T23:35:17Z

jeffra · 2021-02-17T23:51:11Z

An update here! DeepSpeed finally has been integrated as a plugin into Lightning, see our docs here. We've worked hard to make the API flexible whilst reducing friction as much as possible.

If you see run into any problems please leave an issue or message us on our PyTorch Lightning slack channel!

Is there also going to be support for ZeRO-Offload?

There already is, and it's the staple feature with presets out the box, so you don't need to modify your code to use it (for most cases), we also give instructions to tune to reasonable parameters.

Currently the plugin is available from PyTorch Lightning master, but we'll be releasing 1.2 soon with the feature with technical details and benchmarks soon to come.

I'd once again like to suggest that users first check out Sharded Training as this works out the box for more use cases and has complete Lightning Support, where we are still ironing out kinks with DeepSpeed. FairScale will be introducing some exciting features like ZeRO-offload whilst being PyTorch-compliable so keep an eye :)

This is super exciting!! :) thanks for your contributions to DeepSpeed and all your work getting DeepSpeed integrated into lightning! I think we have a chat coming up soon, we would love to hear more about any kinks you discovered with DeepSpeed and how we can iron those out together. Also especially curious on details regarding PyTorch incompatibilities with DeepSpeed?

SeanNaren · 2021-02-18T00:05:57Z

So epic to see you here @jeffra you and your team has done amazing work, I can't wait to see what you guys come up with, and hope we can assist in your work! Thank you for your kind words but it's really your team that did all the work :)

I'll be hopefully pushing a few PRs to DeepSpeed/Lightning to ease integration (a lot of the issues are kinks we need to iron out on our end in PyTorch Lightning). These range from small issues like configuration of the throughput timer, to more involved changes like multi-optimizer/multi-scheduler support or allowing lightning to control Apex/AMP settings outside of DeepSpeed.

I'll be tracking these via issues so we can iterate on them and make the integration even better. I've been using DeepSpeed for a while now across a multitude of models and it's been an incredible experience with the level of customisability. I'm glad we're moving towards the community being able to fine-tune/train larger models!

jeffra · 2021-02-18T00:12:44Z

Thanks for the kind words @SeanNaren! :) Very much looking forward to a great collaboration going forward.

williamFalcon added feature Is an improvement or enhancement help wanted Open to be worked on labels Feb 11, 2020

williamFalcon added this to the 0.6.1 milestone Feb 12, 2020

williamFalcon modified the milestones: 0.6.1, 0.7.0 Feb 12, 2020

Borda modified the milestones: 0.6.1, 0.6.2 Feb 25, 2020

williamFalcon added the High priority label Mar 7, 2020

Borda removed the High priority label Mar 7, 2020

Borda modified the milestones: 0.7.2, 0.7.3 Apr 3, 2020

Borda modified the milestones: 0.7.4, 0.7.5 Apr 24, 2020

Borda modified the milestones: 0.7.6, 0.8.0, 0.7.7 May 13, 2020

Borda added the Important label May 16, 2020

Borda removed this from the 0.7.7 milestone May 26, 2020

SeanNaren mentioned this issue Oct 15, 2020

Sharded Plugin #4178

Closed

edenlightning added the Epic label Nov 19, 2020

stas00 mentioned this issue Nov 24, 2020

Model Parallelism and Big Models huggingface/transformers#8771

Open

edenlightning closed this as completed Nov 30, 2020

edenlightning reopened this Nov 30, 2020

Borda modified the milestones: 1.1, 1.2 Dec 10, 2020

edenlightning modified the milestones: 1.2, 1.3 Feb 8, 2021

SeanNaren mentioned this issue Feb 13, 2021

DeepSpeed Integration #5954

Merged

15 tasks

edenlightning modified the milestones: 1.3, 1.2 Feb 16, 2021

SeanNaren closed this as completed in #5954 Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deepspeed support #817

Add deepspeed support #817

williamFalcon commented Feb 11, 2020

sudarshan85 commented Feb 11, 2020

williamFalcon commented Feb 11, 2020

ghost commented Feb 12, 2020

williamFalcon commented Feb 12, 2020 •

edited

Loading

williamFalcon commented Feb 12, 2020 •

edited

Loading

ghost commented Feb 12, 2020

williamFalcon commented Feb 12, 2020 •

edited

Loading

jeffra commented Feb 13, 2020 •

edited

Loading

SeanNaren commented Oct 14, 2020

javismiles commented Oct 14, 2020

blefaudeux commented Oct 14, 2020

SeanNaren commented Oct 14, 2020

williamFalcon commented Oct 14, 2020

blefaudeux commented Oct 14, 2020

SeanNaren commented Oct 14, 2020

blefaudeux commented Oct 14, 2020 •

edited

Loading

SeanNaren commented Oct 14, 2020

blefaudeux commented Oct 14, 2020 •

edited

Loading

SeanNaren commented Jan 17, 2021

Spenhouet commented Jan 28, 2021

SeanNaren commented Feb 17, 2021 •

edited

Loading

williamFalcon commented Feb 17, 2021

jeffra commented Feb 17, 2021

SeanNaren commented Feb 18, 2021

jeffra commented Feb 18, 2021

Add deepspeed support #817

Add deepspeed support #817

Comments

williamFalcon commented Feb 11, 2020

sudarshan85 commented Feb 11, 2020

williamFalcon commented Feb 11, 2020

ghost commented Feb 12, 2020

williamFalcon commented Feb 12, 2020 • edited Loading

Api

Code changes

1. Init model, optimizers (like amp)

2. do a slightly different forward (like ddp)

3. do a slightly different thing for checkpoint saving

4. 16-bit and ddp

5. set up config automatically

williamFalcon commented Feb 12, 2020 • edited Loading

ghost commented Feb 12, 2020

williamFalcon commented Feb 12, 2020 • edited Loading

jeffra commented Feb 13, 2020 • edited Loading

SeanNaren commented Oct 14, 2020

javismiles commented Oct 14, 2020

blefaudeux commented Oct 14, 2020

SeanNaren commented Oct 14, 2020

williamFalcon commented Oct 14, 2020

blefaudeux commented Oct 14, 2020

SeanNaren commented Oct 14, 2020

blefaudeux commented Oct 14, 2020 • edited Loading

SeanNaren commented Oct 14, 2020

blefaudeux commented Oct 14, 2020 • edited Loading

SeanNaren commented Jan 17, 2021

Spenhouet commented Jan 28, 2021

SeanNaren commented Feb 17, 2021 • edited Loading

williamFalcon commented Feb 17, 2021

jeffra commented Feb 17, 2021

SeanNaren commented Feb 18, 2021

jeffra commented Feb 18, 2021

williamFalcon commented Feb 12, 2020 •

edited

Loading

williamFalcon commented Feb 12, 2020 •

edited

Loading

williamFalcon commented Feb 12, 2020 •

edited

Loading

jeffra commented Feb 13, 2020 •

edited

Loading

blefaudeux commented Oct 14, 2020 •

edited

Loading

blefaudeux commented Oct 14, 2020 •

edited

Loading

SeanNaren commented Feb 17, 2021 •

edited

Loading