Finetuning Bloom model in step 3 failed #451

cokuehuang · 2023-04-28T02:29:47Z

Actor model: Bloom-1.1b
Reward model: Bloom-560m
Finetuning cmd:
bash training_scripts/single_node/run_bloom_1.1b.sh /DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/bloom-1.1b/ /DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/reward_model/bloom-560m
Part of training log:

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer.
------------------------------------
Free memory : 67.217224 (GigaBytes)
Total memory: 79.346863 (GigaBytes)
Requested memory: 0.437500 (GigaBytes)
Setting maximum total tokens (input + output) to 512
WorkSpace: 0x7f8d36000000
------------------------------------
Traceback (most recent call last):
  File "main.py", line 525, in <module>
    main()
  File "main.py", line 434, in main
    out = trainer.generate_experience(prompts)
  File "/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
    seq = self._generate_sequence(prompts)
  File "/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
    generate_ret_vals = self._generate(*inputs, **kwargs)
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/transformers/src/transformers/generation/utils.py", line 1513, in generate
    return self.greedy_search(
  File "/transformers/src/transformers/generation/utils.py", line 2330, in greedy_search
    outputs = self(
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/bloom/modeling_bloom.py", line 913, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/transformers/src/transformers/models/bloom/modeling_bloom.py", line 786, in forward
    outputs = block(
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 147, in forward
    self.attention(input,
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 160, in forward
    context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 253, in compute_attention
    attn_mask=((1 - input_mask).half() * minus_inf),
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/torch/_tensor.py", line 40, in wrapped
    return f(*args, **kwargs)
  File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/torch/_tensor.py", line 848, in __rsub__
    return _C._VariableFunctions.rsub(self, other)
RuntimeError: Subtraction, the `-` operator, with a bool tensor is not supported. If you are trying to invert a mask, use the `~` or `logical_not()` operator instead.`

Howerve, change model to opt works well.

The text was updated successfully, but these errors were encountered:

evi-Genius · 2023-04-28T03:27:29Z

same error

lc222 · 2023-05-05T02:48:44Z

same error

LiinXemmon · 2023-05-05T17:12:53Z

Same error. Modifying the ds_attention.py brings NoImplementationError.

lc222 · 2023-05-06T06:19:04Z

similar but not same error。

File "main.py", line 552, in <module> main() File "main.py", line 458, in main 192.18.75.0: out = trainer.generate_experience(prompts) 192.18.75.0: File "/baichuan/haoyu/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 203, in generate_experience 192.18.75.0: seq = self._generate_sequence(prompts) 192.18.75.0: File "/baichuan/haoyu/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 161, in _generate_sequence 192.18.75.0: seq = self.actor_model.module.generate(prompts, 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context 192.18.75.0: return func(*args, **kwargs) 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/transformers/generation/utils.py", line 1513, in generate 192.18.75.0: return self.greedy_search( 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/transformers/generation/utils.py", line 2330, in greedy_search 192.18.75.0: outputs = self( 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl 192.18.75.0: result = forward_call(*args, **kwargs) 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward 192.18.75.0: transformer_outputs = self.transformer( 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl 192.18.75.0: result = forward_call(*args, **kwargs) 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 730, in forward 192.18.75.0: inputs_embeds = self.word_embeddings(input_ids) 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl 192.18.75.0: result = forward_call(*args, **kwargs) 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward 192.18.75.0: return F.embedding( 192.18.75.0: File "/baichuan/anaconda3/envs/deepspeedchat/lib/python3.8/site-packages/torch/nn/functional.py", line 2210, in embedding 192.18.75.0: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 192.18.75.0: RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

what should i do to fix this error？

stgzr · 2023-05-16T04:11:37Z

Any update to this issue?

scarydemon2 · 2023-05-29T06:52:39Z

same error for actor model :bloomz-7b1 and reward model :opt1.3b

scarydemon2 · 2023-05-29T12:13:28Z

Same error. Modifying the ds_attention.py brings NoImplementationError.

NoImplementationError is caused by softmaxfunction when config.fp16 is False. Perhaps you've modified fp16 to bf16 that in ds_utils.py according to some issue(same as me).
To solve this problem:
Change File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 253, in compute_attention
attn_mask=((1 - input_mask).half() * minus_inf),
Into
attn_mask=((1-input_mask.int()).half() * minus_inf),
will work for me

scarydemon2 · 2023-05-30T09:46:17Z

Same error. Modifying the ds_attention.py brings NoImplementationError.

NoImplementationError is caused by softmaxfunction when config.fp16 is False. Perhaps you've modified fp16 to bf16 that in ds_utils.py according to some issue(same as me). To solve this problem: Change File "/opt/conda/envs/deepspeedchat/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 253, in compute_attention attn_mask=((1 - input_mask).half() * minus_inf), Into attn_mask=((1-input_mask.int()).half() * minus_inf), will work for me

Not working at all. The padding_side for opt is right, while for bloomz it is left. I tried passing in two different tokenizers, but it caused a lot of conflicts when making the experience.

jeffra · 2023-06-09T18:30:41Z

Similar issue on DeepSpeed side: microsoft/DeepSpeed#3518

roy-mzh · 2023-08-21T11:36:30Z

Same error with actor model bloom560m, and critic model opt-350m. Any update?

lekurile · 2023-09-14T20:35:06Z

Hi @cokuehuang,

Can you please try running this again and include the following PR as well:

DS-Chat BLOOM: Fix Attention mask DeepSpeed#4338

I've been able to get this running with the bigscience/bloomz-1b7 BLOOM model:

DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning$ bash training_scripts/bloom/single_node/run_bloom.sh bigscience/bloomz-1b7 ../step2_reward_model_finetuning/bloom_7b_output/ 3 3 output_bloom7b_actor_hf_critic_step2

Thanks,
Lev

lekurile · 2023-10-13T17:04:47Z

Hi @cokuehuang,

Closing the issue for now since solution was provided. If any issues are still encountered, feel free to open another issue.

yaozhewei assigned cmikeh2 May 1, 2023

samadejacobs added the deespeed chat DeepSpeed Chat label May 9, 2023

jomayeri added the new-config A modified config from the given example label May 19, 2023

awan-10 assigned tohtana and unassigned cmikeh2 May 26, 2023

lekurile self-assigned this Sep 14, 2023

lekurile closed this as completed Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Bloom model in step 3 failed #451

Finetuning Bloom model in step 3 failed #451

cokuehuang commented Apr 28, 2023 •

edited

Loading

evi-Genius commented Apr 28, 2023

lc222 commented May 5, 2023

LiinXemmon commented May 5, 2023

lc222 commented May 6, 2023 •

edited

Loading

stgzr commented May 16, 2023

scarydemon2 commented May 29, 2023 •

edited

Loading

scarydemon2 commented May 29, 2023

scarydemon2 commented May 30, 2023

jeffra commented Jun 9, 2023

roy-mzh commented Aug 21, 2023

lekurile commented Sep 14, 2023 •

edited

Loading

lekurile commented Oct 13, 2023

Finetuning Bloom model in step 3 failed #451

Finetuning Bloom model in step 3 failed #451

Comments

cokuehuang commented Apr 28, 2023 • edited Loading

evi-Genius commented Apr 28, 2023

lc222 commented May 5, 2023

LiinXemmon commented May 5, 2023

lc222 commented May 6, 2023 • edited Loading

stgzr commented May 16, 2023

scarydemon2 commented May 29, 2023 • edited Loading

scarydemon2 commented May 29, 2023

scarydemon2 commented May 30, 2023

jeffra commented Jun 9, 2023

roy-mzh commented Aug 21, 2023

lekurile commented Sep 14, 2023 • edited Loading

lekurile commented Oct 13, 2023

cokuehuang commented Apr 28, 2023 •

edited

Loading

lc222 commented May 6, 2023 •

edited

Loading

scarydemon2 commented May 29, 2023 •

edited

Loading

lekurile commented Sep 14, 2023 •

edited

Loading