关于finetune_contine.sh报错与使用finetune后的模型输出错误 #45

BUPTccy · 2023-04-07T09:26:48Z

感谢您的分享，使用过程中出现了一些无法解决的问题，希望向您请教！

使用bash fintune_continue.sh 在原始merge_sample.json上跑了相应fintune任务报错，设置了TOT_CUDA="0"，其他参数、数据和原始一致
使用bash finetune.sh可以正常运行
可以确定的是由于加入--resume_from_checkpoint $lora_checkpoint \导致的错误，其中lora_checkpoint = "./lora-Vicuna/checkpoint-11600"

我们的目标是用预训练的checkpoint微调到专有领域，但直接使用finetune.sh会导致中文问题无限输出重复英文字符串且包含{/begin},{/item}等，想请问下此种现象是否正常？如果使用finetune_continue是否会有改善？

环境
1、操作系统-CentOS7.6
2、显卡-3090 单张
3、python3.8
4、cuda11.3

报错信息如下:

Traceback (most recent call last):
  File "finetune.py", line 271, in <module>
    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
  File "/usr/local/miniconda3/envs/chat/lib/python3.8/site-packages/transformers/trainer.py", line 1659, in train   
    return inner_training_loop(
  File "/usr/local/miniconda3/envs/chat/lib/python3.8/site-packages/transformers/trainer.py", line 1926, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/miniconda3/envs/chat/lib/python3.8/site-packages/transformers/trainer.py", line 2706, in training_step
    self.scaler.scale(loss).backward()
  File "/usr/local/miniconda3/envs/chat/lib/python3.8/site-packages/torch/_tensor.py", line 487, in backward        
    torch.autograd.backward(
  File "/usr/local/miniconda3/envs/chat/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/miniconda3/envs/chat/lib/python3.8/site-packages/torch/autograd/function.py", line 274, in apply 
    return user_fn(self, *args)
  File "/usr/local/miniconda3/envs/chat/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/usr/local/miniconda3/envs/chat/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 
1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across 
multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not
 change during training loop.
2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap 
the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes
 multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try
 to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 127 has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this 
particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to 
either INFO or DETAIL to print parameter names for further debugging.

The text was updated successfully, but these errors were encountered:

Facico · 2023-04-07T15:42:11Z

@BUPTccy 单卡的时候建议不使用我们脚本中的torchrun，而是直接指定对应gpu后用python命令跑，比如“CUDA_VISIBLE_DEVICES=0 python finetune.py --data_path merge.json --test_size 2000”，你的错误和这个issue是一致的

Q：直接使用finetune.sh会导致中文问题无限输出重复英文字符串且包含{/begin},{/item}等？
这个主要首先要看你的数据集是咋样的，因为我们的base model是llama，它是一个多语言模型，但是用英文语料偏多，如果直接用自己的语料来弄的话：首先要保证数据的质量以及格式（比如像我们的merge.json中的instruction格式），然后量要足够大（比如几十万这种规模），然后训练的程度要够大（我们第一个版本的模型是70w语料3个epoch）
如果条件允许，你们可以试试不加lora的全量微调，或者将lora的lora_r增大，或者将lora可微调的参数增多比如TARGET_MODULES=["q_proj","k_proj","v_proj","o_proj"]

Q：如果使用finetune_continue是否会有改善？
应该是有改善的，你可以从我们11600checkpoint继续训，这个checkpoint已经初步具备中文能力了。我们自己试过在上面训练医学问答的垂直领域，是比不训练医学问答时有较强的改善的。我们后续将给出详细的方案。

BUPTccy · 2023-04-10T01:20:10Z

感谢您的解答，已在单卡上解决相应问题，期待您的后续方案~
另外有一点想再叨扰一下，若是直接使用fintune_continue希望训练到专有领域，语料库的规模至少要达到什么数量级以及相应lora有哪些参数可以进行调整？

Facico · 2023-04-10T11:35:59Z

你可以参考我们在医学问答的案例medical，目前这个例子因为是从我们已有的checkpoint继续弄的，所以lora不太需要更新。数据规模看你的任务需求和数据质量，不过建议不要太多，太多其实可能从头开始finetune会好一点

BUPTccy · 2023-04-11T01:09:21Z

明白了感谢您可以随时关闭此issue

Facico closed this as completed Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于finetune_contine.sh报错与使用finetune后的模型输出错误 #45

关于finetune_contine.sh报错与使用finetune后的模型输出错误 #45

BUPTccy commented Apr 7, 2023 •

edited

Loading

Facico commented Apr 7, 2023

BUPTccy commented Apr 10, 2023 •

edited

Loading

Facico commented Apr 10, 2023 •

edited

Loading

BUPTccy commented Apr 11, 2023

关于finetune_contine.sh报错与使用finetune后的模型输出错误 #45

关于finetune_contine.sh报错与使用finetune后的模型输出错误 #45

Comments

BUPTccy commented Apr 7, 2023 • edited Loading

Facico commented Apr 7, 2023

BUPTccy commented Apr 10, 2023 • edited Loading

Facico commented Apr 10, 2023 • edited Loading

BUPTccy commented Apr 11, 2023

BUPTccy commented Apr 7, 2023 •

edited

Loading

BUPTccy commented Apr 10, 2023 •

edited

Loading

Facico commented Apr 10, 2023 •

edited

Loading