Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?) #14928

Closed
5 tasks done
DrMatters opened this issue Sep 28, 2022 · 3 comments · Fixed by #14944
Closed
5 tasks done
Assignees
Labels
bug Something isn't working strategy: deepspeed

Comments

@DrMatters
Copy link

DrMatters commented Sep 28, 2022

First check

  • I'm sure this is a bug.
  • I've added a descriptive title to this bug.
  • I've provided clear instructions on how to reproduce the bug.
  • I've added a code sample.
  • I've provided any other important info that is required.

Bug description

I have encountered some strange behavior while using Trainer with a deepspeed strategy.

  1. Simple setup: single GPU, without a strategy, optimizer='adamw', eval_batch_size=16:
    • Run trainer.validate() after trainer initialization works correctly.
  2. Deepspeed setup: strategy='deepspeed_stage_2', devices=[0], optimizer='adamw':
    • Running trainer.validate() immediately after trainer initialization will result in a CUDA OOM error. This happens at the optimizer initialization step. This is clear from the traceback attached.
    • If instead of trainer.validate() run trainer.fit() it will start fitting the model without problems.
    • If I try to run trainer.test() after finishing 1 epoch trainer.fit() this error appears too.

Does it make any sense to initialize the optimizer for the validation/test? looks like a legit bug for me

How to reproduce the bug

The code is here:
https://gist.github.com/DrMatters/0630919cb82bee6035502213845733b3

This code is a fuse of lightning tutorial for tuning with huggingface and huggingface tutorial to fine-tune a language model.

If you run this code with python faulty_deepspeed_eval.py --trainer.devices [0] it will validate and fit properly.
If you run this code with python faulty_deepspeed_eval.py --trainer.devices [0,1] --trainer.strategy deepspeed_stage_2 it will fail before even starting validation.

Error messages and logs

Traceback (most recent call last):
  File "faulty_deepspeed_eval.py", line 234, in <module>
    main()
  File "faulty_deepspeed_eval.py", line 227, in main
    trainer.validate(model, dm)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
    return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
    self.strategy.setup(self)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
    self.init_deepspeed()
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
    self._initialize_deepspeed_inference(model)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
    model, _, _, _ = deepspeed.initialize(
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 320, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
    self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1330, in _configure_fp16_optimizer
    optimizer = FP16_UnfusedOptimizer(
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 111, in __init__
    self.initialize_optimizer_states()
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 443, in initialize_optimizer_states
    self.optimizer.step()
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
    return func(*args, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/adamw.py", line 129, in step
    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 3; 47.54 GiB total capacity; 45.57 GiB already allocated; 79.81 MiB free; 45.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Important info

- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer
- PyTorch Lightning Version (e.g., 1.5.0): 1.7.6
- PyTorch Version (e.g., 1.10): 1.11.0
- Python version (e.g., 3.9): 3.8.13
- OS (e.g., Linux): Ubuntu 20.04.3
- CUDA/cuDNN version: 11.7
- GPU models and configuration: NVIDIA RTX A6000 (48GB) [x4]
- How you installed Lightning(`conda`, `pip`, source): pip

More info

I mentioned it in the Lightning Community slack. @rohitgr7

@DrMatters DrMatters added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Sep 28, 2022
@DrMatters DrMatters changed the title CUDA OOM when running trainer.validate() when initializing optimizer with deepspeed CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?) Sep 28, 2022
@rohitgr7 rohitgr7 added strategy: deepspeed and removed needs triage Waiting to be triaged by maintainers labels Sep 28, 2022
@rohitgr7 rohitgr7 self-assigned this Sep 28, 2022
@DrMatters
Copy link
Author

Upd: gist updated

@rohitgr7
Copy link
Contributor

hey @DrMatters mind checking with the PR branch to see if it works for you?

@DrMatters
Copy link
Author

DrMatters commented Sep 29, 2022

Seems like this branch is working correctly in my experiments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working strategy: deepspeed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants