Enabling DeepSpeed with ZeRO crashes #1

dan1elherbst · 2022-06-30T08:45:24Z

I tried getting DeepSpeed running the s2ef task running with the cgcnn model (using my latest commit on the deepspeed branch). Using the code as is (i.e. using the plain DeepSpeed trainer without any optimization) works.
However, using the following DeepSpeed config file:

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.0005
    }
  },
  "fp16": {
    "enabled": false
  },
  "zero_optimization": true
}

where the ZeRO optimization is enabled, and running the job on one GPU for now as follows:

(ocp-models) [dherbst@kanon ocp]$ python -u -m torch.distributed.launch --nproc_per_node=1 main.py --distributed --num-gpus 1 --mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml --deepspeed-mode deepspeed-optimizer --deepspeed-config configs/s2ef/200k/cgcnn/ds_config.json

results in the following error:

Traceback (most recent call last):
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
    Runner()(config)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
    self.task.run()
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
    self._backward(loss)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 741, in _backward
    self.optimizer.step()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1660, in step
    self.check_overflow()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1919, in check_overflow
    self._check_overflow(partition_gradients)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1820, in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1839, in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1832, in has_overflow_partitioned_grads_serial
    for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0

The text was updated successfully, but these errors were encountered:

d-stoll · 2022-06-30T15:58:32Z

FP16 has to be enabled when using ZeRO: https://www.deepspeed.ai/docs/config-json/#fp16-training-options

dan1elherbst · 2022-06-30T16:11:37Z

When running with ZeRO stage 2, this error occurs:

Traceback (most recent call last):
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
    Runner()(config)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
    self.task.run()
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
    self._backward(loss)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 712, in _backward
    loss.backward()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 788, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1260, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 833, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

(same error with 1 and 2 gpus)

d-stoll · 2022-06-30T16:18:25Z

microsoft/DeepSpeed#610

dan1elherbst · 2022-06-30T16:22:01Z

After changing

loss.backward()

to

self.model.backward(loss)

in base_trainer._backward(),
training runs with ZeRO optimization stage 2, however overflows occur:

2022-06-30 12:18:17 (INFO): forcesx_mae: 6.02e-02, forcesy_mae: 7.78e-02, forcesz_mae: 7.72e-02, forces_mae: 7.17e-02, forces_cos: -1.03e-03, forces_magnitude: 1.44e-01, energy_mae: 1.93e+00, energy_force_within_threshold: 0.00e+00, loss: 1.41e+00, lr: 1.01e-04, epoch: 1.60e-03, step: 1.00e+01
[2022-06-30 12:18:17,244] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-06-30 12:18:17,395] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-06-30 12:18:17,504] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-06-30 12:18:17,633] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-06-30 12:18:17,740] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-06-30 12:18:17,847] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
[2022-06-30 12:18:17,977] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
[2022-06-30 12:18:18,109] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
[2022-06-30 12:18:18,189] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
[2022-06-30 12:18:18,291] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0

and it crashes after a few batches with the following error:

Traceback (most recent call last):
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
    Runner()(config)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
    self.task.run()
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
    self._backward(loss)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 741, in _backward
    self.optimizer.step()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1667, in step
    self._update_scale(self.overflow)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1922, in _update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 156, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

d-stoll · 2022-06-30T16:28:10Z

Set tensor type to bf16 (needs hardware support)

dan1elherbst · 2022-06-30T17:00:26Z

Overview of what works/doesn't work currently:

Deepspeed fp16 works after setting the type of the network input tensor to half precision automatically. We will further investigate if this can also be achieved automatically or make the current implementation prettier/more modular and incorporate this also in other models.
Deepspeed ZeRO stage 1 (which was default when I ran things previously) still doesn’t work and produces the same KeyError.
Deepspeed ZeRO stage 2 works after a slight modification of the backward pass call. However, with the fp16 optimization activated, there were overflows. Changing the used datatype from half to bfloat16 got us rid of these overflows as bfloat16 has a wider range as the standard half precision.
For now, we will not pursue working with ZeRO stage 1 and focus on trying out various configurations of ZeRO stage 2. Probably ZeRO stage 1 is not as widely used and our error was already reported multiple times without any helpful/working solutions.

dan1elherbst changed the title ~~Enabling DeepSpeed with ZeRO or fp16 crashes~~ Enabling DeepSpeed with ZeRO crashes Jun 30, 2022

dan1elherbst closed this as completed Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling DeepSpeed with ZeRO crashes #1

Enabling DeepSpeed with ZeRO crashes #1

dan1elherbst commented Jun 30, 2022

d-stoll commented Jun 30, 2022 •

edited

Loading

dan1elherbst commented Jun 30, 2022 •

edited

Loading

d-stoll commented Jun 30, 2022

dan1elherbst commented Jun 30, 2022

d-stoll commented Jun 30, 2022

dan1elherbst commented Jun 30, 2022

Enabling DeepSpeed with ZeRO crashes #1

Enabling DeepSpeed with ZeRO crashes #1

Comments

dan1elherbst commented Jun 30, 2022

d-stoll commented Jun 30, 2022 • edited Loading

dan1elherbst commented Jun 30, 2022 • edited Loading

d-stoll commented Jun 30, 2022

dan1elherbst commented Jun 30, 2022

d-stoll commented Jun 30, 2022

dan1elherbst commented Jun 30, 2022

d-stoll commented Jun 30, 2022 •

edited

Loading

dan1elherbst commented Jun 30, 2022 •

edited

Loading