You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running trainer.validate() immediately after trainer initialization will result in a CUDA OOM error. This happens at the optimizer initialization step. This is clear from the traceback attached.
If instead of trainer.validate() run trainer.fit() it will start fitting the model without problems.
If I try to run trainer.test() after finishing 1 epoch trainer.fit() this error appears too.
Does it make any sense to initialize the optimizer for the validation/test? looks like a legit bug for me
If you run this code with python faulty_deepspeed_eval.py --trainer.devices [0] it will validate and fit properly.
If you run this code with python faulty_deepspeed_eval.py --trainer.devices [0,1] --trainer.strategy deepspeed_stage_2 it will fail before even starting validation.
Error messages and logs
Traceback (most recent call last):
File "faulty_deepspeed_eval.py", line 234, in <module>
main()
File "faulty_deepspeed_eval.py", line 227, in main
trainer.validate(model, dm)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 774, in validate
return self._call_and_handle_interrupt(self._validate_impl, model, dataloaders, ckpt_path, verbose, datamodule)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 821, in _validate_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1147, in _run
self.strategy.setup(self)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 376, in setup
self.init_deepspeed()
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 492, in init_deepspeed
self._initialize_deepspeed_inference(model)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/pytorch_lightning/strategies/deepspeed.py", line 602, in _initialize_deepspeed_inference
model, _, _, _ = deepspeed.initialize(
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/__init__.py", line 124, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 320, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1162, in _configure_optimizer
self.optimizer = self._configure_fp16_optimizer(basic_optimizer)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1330, in _configure_fp16_optimizer
optimizer = FP16_UnfusedOptimizer(
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 111, in __init__
self.initialize_optimizer_states()
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/deepspeed/runtime/fp16/unfused_optimizer.py", line 443, in initialize_optimizer_states
self.optimizer.step()
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ksemin/miniconda3/envs/pl4_deep_new/lib/python3.8/site-packages/torch/optim/adamw.py", line 129, in step
state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 3; 47.54 GiB total capacity; 45.57 GiB already allocated; 79.81 MiB free; 45.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Important info
- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer
- PyTorch Lightning Version (e.g., 1.5.0): 1.7.6
- PyTorch Version (e.g., 1.10): 1.11.0
- Python version (e.g., 3.9): 3.8.13
- OS (e.g., Linux): Ubuntu 20.04.3
- CUDA/cuDNN version: 11.7
- GPU models and configuration: NVIDIA RTX A6000 (48GB) [x4]
- How you installed Lightning(`conda`, `pip`, source): pip
DrMatters
changed the title
CUDA OOM when running trainer.validate() when initializing optimizer with deepspeed
CUDA OOM when running trainer.validate() with deepspeed at optimizer initialization (?)
Sep 28, 2022
First check
Bug description
I have encountered some strange behavior while using
Trainer
with adeepspeed
strategy.optimizer='adamw'
,eval_batch_size=16
:trainer.validate()
after trainer initialization works correctly.strategy='deepspeed_stage_2'
,devices=[0]
,optimizer='adamw'
:trainer.validate()
immediately aftertrainer
initialization will result in a CUDA OOM error. This happens at the optimizer initialization step. This is clear from the traceback attached.trainer.validate()
runtrainer.fit()
it will start fitting the model without problems.trainer.test()
after finishing 1 epochtrainer.fit()
this error appears too.Does it make any sense to initialize the optimizer for the validation/test? looks like a legit bug for me
How to reproduce the bug
The code is here:
https://gist.github.com/DrMatters/0630919cb82bee6035502213845733b3
This code is a fuse of lightning tutorial for tuning with huggingface and huggingface tutorial to fine-tune a language model.
If you run this code with
python faulty_deepspeed_eval.py --trainer.devices [0]
it will validate and fit properly.If you run this code with
python faulty_deepspeed_eval.py --trainer.devices [0,1] --trainer.strategy deepspeed_stage_2
it will fail before even starting validation.Error messages and logs
Important info
More info
I mentioned it in the Lightning Community slack. @rohitgr7
The text was updated successfully, but these errors were encountered: