Train using qlora exist with error #15

suclogger · 2023-06-06T14:50:04Z

train script as follow :

CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \
    --model_name_or_path /xx/model/model_weights/Ziya-LLaMA-13B \
    --do_train \
    --dataset xx \
    --finetuning_type lora \
    --output_dir /xx/output \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-3 \
    --num_train_epochs 10.0 \
    --resume_lora_training False \
    --plot_loss \
    --fp16 \
    --quantization_bit 4

error message as follow :

Traceback (most recent call last):
  File "/xxx/src/train_sft.py", line 97, in <module>
    main()
  File "/xxx/src/train_sft.py", line 69, in main
    train_result = trainer.train()
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/trainer.py", line 1638, in train
    return inner_training_loop(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/trainer.py", line 1923, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/trainer.py", line 2733, in training_step
    loss = self.compute_loss(model, inputs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/trainer.py", line 2758, in compute_loss
    outputs = model(**inputs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/utils/operations.py", line 553, in forward
    return model_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/utils/operations.py", line 541, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
    return func(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 570, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 566, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 194, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/anaconda3/envs/chatglm_etuning/lib/python3.10/site-packages/peft/tuners/lora.py", line 565, in forward
    result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (724x5120 and 1x13107200)
  0%|                                                                                                                | 0/30 [00:00<?, ?it/s]

The text was updated successfully, but these errors were encountered:

suclogger · 2023-06-06T15:15:07Z

relate issue :

artidoro/qlora#100
artidoro/qlora#12

suclogger · 2023-06-06T15:28:42Z

upgrade peft solve the issue.

pip install -U git+https://github.com/huggingface/peft.git

starphantom666 · 2023-06-08T10:06:59Z

upgrade peft solve the issue.

pip install -U git+https://github.com/huggingface/peft.git

我怎么还是报错。。。。升级了

更新一下好了，原来先要卸载，再重装才行

suclogger mentioned this issue Jun 6, 2023

QLoRA 训练报错 #13

Closed

suclogger changed the title ~~Train exist with error~~ Train using qlora exist with error Jun 6, 2023

suclogger closed this as completed Jun 6, 2023

hiyouga added the solved This problem has been already solved label Jun 7, 2023

YooSungHyun mentioned this issue Jul 11, 2023

Example Question (got error) : Try new 40B LLMs demo in Kaggle BlackSamorez/tensor_parallel#96

Closed

godfly mentioned this issue Aug 17, 2023

大数据量全参数预训练报错、流式读数据报错 #549

Closed

YananSunn mentioned this issue Aug 31, 2023

单节点多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

Closed

liwenju0 mentioned this issue Sep 18, 2023

when running tokenizer on datasets，program crashed #954

Closed

This was referenced Dec 3, 2023

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized. #1716

Closed

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.2 #1725

Closed

Mr-Otaku-Lin mentioned this issue Jun 13, 2024

Qwen2-7B lora训练后推理出错 #4251

Closed

1 task

zhoushaoxiang mentioned this issue Jun 14, 2024

Ascend-D910 训练 RuntimeError: SET StreamOverflowSwitch Failed. #4284

Closed

1 task

ldknight mentioned this issue Jul 2, 2024

glm4在stage==rm微调时评估出现：CUDA error: device-side assert triggered #4646

Closed

1 task

hiennguyennq mentioned this issue Oct 21, 2024

distributed training: using GPU 0 to perform barrier as devices used by this process are currently unknown. #5769

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train using qlora exist with error #15

Train using qlora exist with error #15

suclogger commented Jun 6, 2023

suclogger commented Jun 6, 2023

suclogger commented Jun 6, 2023

starphantom666 commented Jun 8, 2023 •

edited

Loading

Train using qlora exist with error #15

Train using qlora exist with error #15

Comments

suclogger commented Jun 6, 2023

suclogger commented Jun 6, 2023

suclogger commented Jun 6, 2023

starphantom666 commented Jun 8, 2023 • edited Loading

starphantom666 commented Jun 8, 2023 •

edited

Loading