CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?) #14928

DrMatters · 2022-09-28T19:36:35Z

First check

I'm sure this is a bug.
I've added a descriptive title to this bug.
I've provided clear instructions on how to reproduce the bug.
I've added a code sample.
I've provided any other important info that is required.

Bug description

I have encountered some strange behavior while using Trainer with a deepspeed strategy.

Simple setup: single GPU, without a strategy, optimizer='adamw', eval_batch_size=16:
- Run trainer.validate() after trainer initialization works correctly.
Deepspeed setup: strategy='deepspeed_stage_2', devices=[0], optimizer='adamw':
- Running trainer.validate() immediately after trainer initialization will result in a CUDA OOM error. This happens at the optimizer initialization step. This is clear from the traceback attached.
- If instead of trainer.validate() run trainer.fit() it will start fitting the model without problems.
- If I try to run trainer.test() after finishing 1 epoch trainer.fit() this error appears too.

Does it make any sense to initialize the optimizer for the validation/test? looks like a legit bug for me

How to reproduce the bug

The code is here:
https://gist.github.com/DrMatters/0630919cb82bee6035502213845733b3

This code is a fuse of lightning tutorial for tuning with huggingface and huggingface tutorial to fine-tune a language model.

If you run this code with python faulty_deepspeed_eval.py --trainer.devices [0] it will validate and fit properly.
If you run this code with python faulty_deepspeed_eval.py --trainer.devices [0,1] --trainer.strategy deepspeed_stage_2 it will fail before even starting validation.

Error messages and logs

Traceback (most recent call last):
  File "faulty_deepspeed_eval.py", line 234, in <module>
    main()
  File "faulty_deepspeed_eval.py", line 227, in main
    trainer.validate(model, dm)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
    return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
    self.strategy.setup(self)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
    self.init_deepspeed()
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
    self._initialize_deepspeed_inference(model)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
    model, _, _, _ = deepspeed.initialize(
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 320, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1330, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 111, in __init__
    self.initialize_optimizer_states()
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 443, in initialize_optimizer_states
    self.optimizer.step()
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/adamw.py", line 129, in step
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 3; 47.54 GiB total capacity; 45.57 GiB already allocated; 79.81 MiB free; 45.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Important info

- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer
- PyTorch Lightning Version (e.g., 1.5.0): 1.7.6
- PyTorch Version (e.g., 1.10): 1.11.0
- Python version (e.g., 3.9): 3.8.13
- OS (e.g., Linux): Ubuntu 20.04.3
- CUDA/cuDNN version: 11.7
- GPU models and configuration: NVIDIA RTX A6000 (48GB) [x4]
- How you installed Lightning(`conda`, `pip`, source): pip

More info

I mentioned it in the Lightning Community slack. @rohitgr7

The text was updated successfully, but these errors were encountered:

DrMatters · 2022-09-28T20:47:52Z

Upd: gist updated

rohitgr7 · 2022-09-29T16:15:11Z

hey @DrMatters mind checking with the PR branch to see if it works for you?

DrMatters · 2022-09-29T18:22:08Z

Seems like this branch is working correctly in my experiments!

DrMatters added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Sep 28, 2022

DrMatters changed the title ~~CUDA OOM when running trainer.validate() when initializing optimizer with deepspeed~~ CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?) Sep 28, 2022

rohitgr7 added strategy: deepspeed and removed needs triage Waiting to be triaged by maintainers labels Sep 28, 2022

rohitgr7 self-assigned this Sep 28, 2022

rohitgr7 mentioned this issue Sep 29, 2022

Avoid initializing optimizers during deepspeed evaluation #14944

Merged

12 tasks

rohitgr7 closed this as completed in #14944 Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?) #14928

CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?) #14928

DrMatters commented Sep 28, 2022 •

edited

Loading

DrMatters commented Sep 28, 2022

rohitgr7 commented Sep 29, 2022

DrMatters commented Sep 29, 2022 •

edited

Loading

CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?) #14928

CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?) #14928

Comments

DrMatters commented Sep 28, 2022 • edited Loading

First check

Bug description

How to reproduce the bug

Error messages and logs

Important info

More info

DrMatters commented Sep 28, 2022

rohitgr7 commented Sep 29, 2022

DrMatters commented Sep 29, 2022 • edited Loading

DrMatters commented Sep 28, 2022 •

edited

Loading

DrMatters commented Sep 29, 2022 •

edited

Loading