-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling DeepSpeed with ZeRO crashes #1
Comments
FP16 has to be enabled when using ZeRO: https://www.deepspeed.ai/docs/config-json/#fp16-training-options |
When running with ZeRO stage 2, this error occurs: Traceback (most recent call last):
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
Runner()(config)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
self.task.run()
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
self.trainer.train(
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
self._backward(loss)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 712, in _backward
loss.backward()
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
Variable._execution_engine.run_backward(
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 788, in reduce_partition_and_remove_grads
self.reduce_ready_partitions_and_remove_grads(param, i)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1260, in reduce_ready_partitions_and_remove_grads
self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 833, in reduce_independent_p_g_buckets_and_remove_grads
new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index' (same error with 1 and 2 gpus) |
After changing loss.backward() to self.model.backward(loss) in 2022-06-30 12:18:17 (INFO): forcesx_mae: 6.02e-02, forcesy_mae: 7.78e-02, forcesz_mae: 7.72e-02, forces_mae: 7.17e-02, forces_cos: -1.03e-03, forces_magnitude: 1.44e-01, energy_mae: 1.93e+00, energy_force_within_threshold: 0.00e+00, loss: 1.41e+00, lr: 1.01e-04, epoch: 1.60e-03, step: 1.00e+01
[2022-06-30 12:18:17,244] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-06-30 12:18:17,395] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-06-30 12:18:17,504] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-06-30 12:18:17,633] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-06-30 12:18:17,740] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-06-30 12:18:17,847] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
[2022-06-30 12:18:17,977] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
[2022-06-30 12:18:18,109] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
[2022-06-30 12:18:18,189] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
[2022-06-30 12:18:18,291] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0 and it crashes after a few batches with the following error: Traceback (most recent call last):
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
Runner()(config)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
self.task.run()
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
self.trainer.train(
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
self._backward(loss)
File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 741, in _backward
self.optimizer.step()
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1667, in step
self._update_scale(self.overflow)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1922, in _update_scale
self.loss_scaler.update_scale(has_overflow)
File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 156, in update_scale
raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. |
Set tensor type to bf16 (needs hardware support) |
Overview of what works/doesn't work currently:
|
I tried getting DeepSpeed running the s2ef task running with the cgcnn model (using my latest commit on the
deepspeed
branch). Using the code as is (i.e. using the plain DeepSpeed trainer without any optimization) works.However, using the following DeepSpeed config file:
where the ZeRO optimization is enabled, and running the job on one GPU for now as follows:
results in the following error:
The text was updated successfully, but these errors were encountered: