Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]running step3 use bloomz + lora + zero3, raise RuntimeError(f"{param.ds_summary()} already in registry") #3528

Closed
liuaiting opened this issue May 12, 2023 · 6 comments
Assignees
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat

Comments

@liuaiting
Copy link

Describe the bug
When running step 3 with ZERO stage 3 enabled and lora for both the actor and critic models.
An error was reported, it seems to tell me that bloomz does not support zero3+lora.

Log output

Traceback (most recent call last):
File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 630, in <module>
  main()
File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 477, in main
  out = trainer.generate_experience(batch_prompt['prompt'],
File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 108, in generate_experience
  output = self.actor_model(seq, attention_mask=attention_mask)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
  return forward_call(*input, **kwargs)
File "DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
  ret_val = func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/engine.py", line 1695, in forward
  loss = self.module(*inputs, **kwargs)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
  result = forward_call(*input, **kwargs)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
  transformer_outputs = self.transformer(
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
  result = forward_call(*input, **kwargs)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 730, in forward
  inputs_embeds = self.word_embeddings(input_ids)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1197, in _call_impl
  result = hook(self, input)
File "DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
  ret_val = func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 366, in _pre_forward_module_hook
  self.pre_sub_module_forward_function(module)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
  return func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 478, in pre_sub_module_forward_function
  param_coordinator.fetch_sub_module(sub_module)
File "DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
  ret_val = func(*args, **kwargs)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
  return func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 249, in fetch_sub_module
  self.__all_gather_params(params_to_fetch)
File "DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
  ret_val = func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 383, in __all_gather_params
  self.__inflight_param_registry[param] = handle
File "DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 51, in __setitem__
  raise RuntimeError(f"{param.ds_summary()} already in registry")
RuntimeError: {'id': 0, 'status': 'INFLIGHT', 'numel': 1027604480, 'ds_numel': 1027604480, 'shape': (250880, 4096), 'ds_shape': (250880, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set()} already in registry

To Reproduce
the run.sh is:

sh training_scripts/single_node/run_bloom_1b7.sh \
  bigscience/bloomz-1b7 \
  bigscience/bloomz-1b7 \
  3 \
  3 \
  output_single_node_bloomz1b7

the run_bloom_1b7.sh is:

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=${3:-2}
CRITIC_ZERO_STAGE=${4:-2}
OUTPUT=${5:-'./output'}
NUM_GPUS=${6:-8}
NUM_NODES=${7:-1}
mkdir -p $OUTPUT

Num_Padding_at_Beginning=0 # this is model related

Actor_Lr=9.65e-6
Critic_Lr=5e-6
hostname='localhost'

export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export TOKENIZERS_PARALLELISM=false

deepspeed --master_port 25303 --master_addr ${hostname} --num_gpus ${NUM_GPUS} --num_nodes ${NUM_NODES} --hostfile 'deepspeed_hostfile' main.py \
  --data_path Dahoas/rm-static \
  --data_split 2,4,4 \
  --actor_model_name_or_path $ACTOR_MODEL_PATH \
  --critic_model_name_or_path $CRITIC_MODEL_PATH \
  --num_padding_at_beginning $Num_Padding_at_Beginning \
  --per_device_train_batch_size 1 \
  --per_device_mini_train_batch_size 1 \
  --generation_batch_numbers 1 \
  --ppo_epochs 1 \
  --max_answer_seq_len 256 \
  --max_prompt_seq_len 256 \
  --actor_learning_rate ${Actor_Lr} \
  --critic_learning_rate ${Critic_Lr} \
  --disable_actor_dropout \
  --num_train_epochs 1 \
  --lr_scheduler_type cosine \
  --gradient_accumulation_steps 1 \
  --num_warmup_steps 100 \
  --deepspeed --seed 1234 \
  --inference_tp_size 1 \
  --tp_gather_partition_size ${NUM_GPUS} \
  --actor_zero_stage $ACTOR_ZERO_STAGE \
  --critic_zero_stage $CRITIC_ZERO_STAGE \
  --actor_lora_dim 128 \
  --actor_lora_module_name query_key_value \
  --critic_lora_dim 128 \
  --critic_lora_module_name query_key_value \
  --only_optimize_lora \
  --output_dir $OUTPUT |& tee $OUTPUT/training.log

Expected behavior
use zero3+lora for training step3

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/venv/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/usr/local/venv/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.3+194053b, 194053b, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

Screenshots
no. The error is in the Log output

System info (please complete the following information):

  • OS: Linux version 4.18.0-240.el8.x86_64. CentOS Linux 7 (Core).
  • GPU count and types: one machine with x8 A100s each
  • Python version: 3.9.13

Docker context
no

Additional context
no

@liuaiting liuaiting added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels May 12, 2023
@liuaiting
Copy link
Author

@HeyangQin HeyangQin self-assigned this May 16, 2023
@HeyangQin
Copy link
Contributor

Hello @liuaiting. Thank you for reporting this issue to us. One of our recent fixes #3462 may have already fixed this error. Could you update your deepspeed and give it another try?

@liuaiting
Copy link
Author

Hello @liuaiting. Thank you for reporting this issue to us. One of our recent fixes #3462 may have already fixed this error. Could you update your deepspeed and give it another try?

After I update deepspeed, it can run successfully, thank you very much for your reply.

@HeyangQin
Copy link
Contributor

@liuaiting Glad to hear the error is fixed. Closing the issue

@jiahuanluo
Copy link

jiahuanluo commented Sep 3, 2023

@HeyangQin Still encounter this with the deepspeed version 0.10.3, running step3 use llama2 + lora + zero3, v100*32G

anaconda3.9/envs/dschat/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
raise RuntimeError(f"{param.ds_summary()} already in registry")
RuntimeError: {'id': 0, 'status': 'INFLIGHT', 'numel': 262144000, 'ds_numel': 262144000, 'shape': (64000, 4096), 'ds_shape': (64000, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536000])} already in registry

@omeruth
Copy link

omeruth commented Sep 15, 2023

Even though my local copy of repository is up to date I am encountering this error. Log is below. Last line of the log shows the command I run with all the options.


Epoch: 0 | Step: 75 | PPO Epoch: 1 | Actor Loss: 0.05474853515625 | Critic Loss: 0.0821533203125 | Unsupervised Loss: 0.0
End-to-End => Latency: 76.57s, TFLOPs: 0.72, Samples/sec: 0.10, Time/seq 9.57s, Batch Size: 8, Total Seq. Length: 512
Generation => Latency: 73.24s, Per-token Latency 286.11 ms, TFLOPs: 0.18, BW: 93.15 GB/sec, Answer Seq. Length: 256
Training => Latency: 3.33s, TFLOPs: 12.65
Actor Model Parameters => 13.325 B, Critic Model Parameters => 0.331 B
Average reward score: -1.51953125

Invalidate trace cache @ step 55440: expected module 0, but got module 13
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
main()
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()
main() File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main

File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
out = trainer.generate_experience(batch_prompt['prompt'],
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],
out = trainer.generate_experience(batch_prompt['prompt'], File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
output = self.actor_model(seq, attention_mask=attention_mask)
out = trainer.generate_experience(batch_prompt['prompt'],
main() File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
main()
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main

main()output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

  File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
output = self.actor_model(seq, attention_mask=attention_mask)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
out = trainer.generate_experience(batch_prompt['prompt'],out = trainer.generate_experience(batch_prompt['prompt'],

  File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience

File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],
return forward_call(*args, **kwargs) File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience

          File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
output = self.actor_model(seq, attention_mask=attention_mask)    output = self.actor_model(seq, attention_mask=attention_mask)return forward_call(*args, **kwargs)    return forward_call(*args, **kwargs)

return forward_call(*args, **kwargs)

output = self.actor_model(seq, attention_mask=attention_mask)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
return forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
return forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
ret_val = func(*args, **kwargs)loss = self.module(*inputs, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
loss = self.module(*inputs, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
loss = self.module(*inputs, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
loss = self.module(*inputs, **kwargs) result = forward_call(*args, **kwargs)
result = forward_call(*args, **kwargs)result = forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl

File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
outputs = self.model.decoder(
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
result = forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
main()
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
outputs = self.model.decoder( outputs = self.model.decoder(outputs = self.model.decoder(
result = forward_call(*args, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
out = trainer.generate_experience(batch_prompt['prompt'],result = forward_call(*args, **kwargs)outputs = self.model.decoder(outputs = self.model.decoder(

File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
output = self.actor_model(seq, attention_mask=attention_mask)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
result = forward_call(*args, **kwargs) outputs = self.model.decoder(result = forward_call(*args, **kwargs)
result = forward_call(*args, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
pos_embeds = self.embed_positions(attention_mask, past_key_values_length)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward

File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
result = forward_call(*args, **kwargs)result = forward_call(*args, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
return forward_call(*args, **kwargs)pos_embeds = self.embed_positions(attention_mask, past_key_values_length)

pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

pos_embeds = self.embed_positions(attention_mask, past_key_values_length)    result = forward_call(*args, **kwargs)ret_val = func(*args, **kwargs)  File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

pos_embeds = self.embed_positions(attention_mask, past_key_values_length)
result = hook(self, args)

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
result = hook(self, args)
result = hook(self, args) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook

ret_val = func(*args, **kwargs)
pos_embeds = self.embed_positions(attention_mask, past_key_values_length)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
result = hook(self, args)

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
result = hook(self, args) self.pre_sub_module_forward_function(module)
result = hook(self, args)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.pre_sub_module_forward_function(module)ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
self.pre_sub_module_forward_function(module)

  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
loss = self.module(*inputs, **kwargs)

result = hook(self, args)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl

param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook

param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
      File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
self.pre_sub_module_forward_function(module)ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook

ret_val = func(*args, **kwargs)  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
result = forward_call(*args, **kwargs)
    ret_val = func(*args, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
return func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return func(*args, **kwargs)
return func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
outputs = self.model.decoder( File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

ret_val = func(*args, **kwargs)param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)  File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl

ret_val = func(*args, **kwargs)

  File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs) self.__all_gather_params(params_to_fetch, forward)
self.__all_gather_params(params_to_fetch, forward)
self.__all_gather_params(params_to_fetch, forward) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module

  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return func(*args, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)return func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module

result = forward_call(*args, **kwargs)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
ret_val = func(*args, **kwargs)

self.__all_gather_params(params_to_fetch, forward)

File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return func(*args, **kwargs)self.__all_gather_params(params_to_fetch, forward)

  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
      File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

ret_val = func(*args, **kwargs)self.__all_gather_params(params_to_fetch, forward)self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)

self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)pos_embeds = self.embed_positions(attention_mask, past_key_values_length)  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params
ret_val = func(*args, **kwargs)

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params
self.__all_gather_params(params_to_fetch, forward) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
ret_val = func(*args, **kwargs)

self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.__inflight_param_registry[param] = handle File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params

self.__inflight_param_registry[param] = handle
      File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gather_params_

self.__inflight_param_registry[param] = handle
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem

  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gather_params_

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
result = hook(self, args)raise RuntimeError(f"{param.ds_summary()} already in registry") File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params
self.__inflight_param_registry[param] = handleraise RuntimeError(f"{param.ds_summary()} already in registry")

raise RuntimeError(f"{param.ds_summary()} already in registry")

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

    RuntimeError  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in __setitem__
RuntimeErrorself.__inflight_param_registry[param] = handleself.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)RuntimeError    : self.__inflight_param_registry[param] = handle: 

: ret_val = func(*args, **kwargs){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry

{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params
raise RuntimeError(f"{param.ds_summary()} already in registry"){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry

File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem

  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook

raise RuntimeError(f"{param.ds_summary()} already in registry")RuntimeError
: raise RuntimeError(f"{param.ds_summary()} already in registry")self.__inflight_param_registry[param] = handle{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
RuntimeError

: RuntimeError  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in __setitem__

self.pre_sub_module_forward_function(module){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry:

{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry  File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function

raise RuntimeError(f"{param.ds_summary()} already in registry")

RuntimeError: {'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
self.__all_gather_params(params_to_fetch, forward)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params
self.__inflight_param_registry[param] = handle
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
raise RuntimeError(f"{param.ds_summary()} already in registry")
RuntimeError: {'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
[2023-09-15 10:36:50,504] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907797
[2023-09-15 10:36:50,546] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907798
[2023-09-15 10:36:50,547] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907799
[2023-09-15 10:36:51,115] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907800
[2023-09-15 10:36:51,443] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907801
[2023-09-15 10:36:52,095] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907802
[2023-09-15 10:36:52,138] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907803
[2023-09-15 10:36:52,178] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907804
[2023-09-15 10:36:52,218] [ERROR] [launch.py:321:sigkill_handler] ['/home/user1/venv/ds/bin/python3', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b', '--critic_model_name_or_path', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m', '--num_padding_at_beginning', '1', '--per_device_generation_batch_size', '1', '--per_device_training_batch_size', '1', '--generation_batches', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '256', '--max_prompt_seq_len', '256', '--actor_learning_rate', '5e-4', '--critic_learning_rate', '5e-6', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--offload_reference_model', '--gradient_accumulation_steps', '1', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--inference_tp_size', '2', '--actor_zero_stage', '3', '--critic_zero_stage', '3', '--disable_actor_dropout', '--actor_lora_dim', '128', '--actor_lora_module_name', 'decoder.layers.', '--output_dir', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/13b'] exits with return code = 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat
Projects
None yet
Development

No branches or pull requests

4 participants