Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling DeepSpeed with ZeRO crashes #1

Closed
dan1elherbst opened this issue Jun 30, 2022 · 6 comments
Closed

Enabling DeepSpeed with ZeRO crashes #1

dan1elherbst opened this issue Jun 30, 2022 · 6 comments

Comments

@dan1elherbst
Copy link

I tried getting DeepSpeed running the s2ef task running with the cgcnn model (using my latest commit on the deepspeed branch). Using the code as is (i.e. using the plain DeepSpeed trainer without any optimization) works.
However, using the following DeepSpeed config file:

{
  "train_batch_size": 32,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 0.0005
    }
  },
  "fp16": {
    "enabled": false
  },
  "zero_optimization": true
}

where the ZeRO optimization is enabled, and running the job on one GPU for now as follows:

(ocp-models) [dherbst@kanon ocp]$ python -u -m torch.distributed.launch --nproc_per_node=1 main.py --distributed --num-gpus 1 --mode train --config-yml configs/s2ef/200k/cgcnn/cgcnn.yml --deepspeed-mode deepspeed-optimizer --deepspeed-config configs/s2ef/200k/cgcnn/ds_config.json

results in the following error:

Traceback (most recent call last):
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
    Runner()(config)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
    self.task.run()
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
    self._backward(loss)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 741, in _backward
    self.optimizer.step()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1660, in step
    self.check_overflow()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1919, in check_overflow
    self._check_overflow(partition_gradients)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1820, in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1839, in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1832, in has_overflow_partitioned_grads_serial
    for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0
@dan1elherbst dan1elherbst changed the title Enabling DeepSpeed with ZeRO or fp16 crashes Enabling DeepSpeed with ZeRO crashes Jun 30, 2022
@d-stoll
Copy link

d-stoll commented Jun 30, 2022

FP16 has to be enabled when using ZeRO: https://www.deepspeed.ai/docs/config-json/#fp16-training-options

@dan1elherbst
Copy link
Author

dan1elherbst commented Jun 30, 2022

When running with ZeRO stage 2, this error occurs:

Traceback (most recent call last):
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
    Runner()(config)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
    self.task.run()
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
    self._backward(loss)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 712, in _backward
    loss.backward()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/torch/autograd/__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 788, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1260, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 833, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

(same error with 1 and 2 gpus)

@d-stoll
Copy link

d-stoll commented Jun 30, 2022

microsoft/DeepSpeed#610

@dan1elherbst
Copy link
Author

After changing

loss.backward()

to

self.model.backward(loss)

in base_trainer._backward(),
training runs with ZeRO optimization stage 2, however overflows occur:

2022-06-30 12:18:17 (INFO): forcesx_mae: 6.02e-02, forcesy_mae: 7.78e-02, forcesz_mae: 7.72e-02, forces_mae: 7.17e-02, forces_cos: -1.03e-03, forces_magnitude: 1.44e-01, energy_mae: 1.93e+00, energy_force_within_threshold: 0.00e+00, loss: 1.41e+00, lr: 1.01e-04, epoch: 1.60e-03, step: 1.00e+01
[2022-06-30 12:18:17,244] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-06-30 12:18:17,395] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-06-30 12:18:17,504] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-06-30 12:18:17,633] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-06-30 12:18:17,740] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-06-30 12:18:17,847] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
[2022-06-30 12:18:17,977] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
[2022-06-30 12:18:18,109] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
[2022-06-30 12:18:18,189] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
[2022-06-30 12:18:18,291] [INFO] [stage_1_and_2.py:1671:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8192.0, reducing to 4096.0

and it crashes after a few batches with the following error:

Traceback (most recent call last):
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 128, in <module>
    Runner()(config)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/main.py", line 68, in __call__
    self.task.run()
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/tasks/task.py", line 49, in run
    self.trainer.train(
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/forces_trainer.py", line 333, in train
    self._backward(loss)
  File "/home/dherbst/tum-di-lab/OCP-Baseline/ocp/ocpmodels/trainers/base_trainer.py", line 741, in _backward
    self.optimizer.step()
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1667, in step
    self._update_scale(self.overflow)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1922, in _update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/home/dherbst/miniconda3/envs/ocp-models/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 156, in update_scale
    raise Exception(
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

@d-stoll
Copy link

d-stoll commented Jun 30, 2022

Set tensor type to bf16 (needs hardware support)

@dan1elherbst
Copy link
Author

Overview of what works/doesn't work currently:

  • Deepspeed fp16 works after setting the type of the network input tensor to half precision automatically. We will further investigate if this can also be achieved automatically or make the current implementation prettier/more modular and incorporate this also in other models.

  • Deepspeed ZeRO stage 1 (which was default when I ran things previously) still doesn’t work and produces the same KeyError.

  • Deepspeed ZeRO stage 2 works after a slight modification of the backward pass call. However, with the fp16 optimization activated, there were overflows. Changing the used datatype from half to bfloat16 got us rid of these overflows as bfloat16 has a wider range as the standard half precision.

  • For now, we will not pursue working with ZeRO stage 1 and focus on trying out various configurations of ZeRO stage 2. Probably ZeRO stage 1 is not as widely used and our error was already reported multiple times without any helpful/working solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants