Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在2080ti上运行 finetune提示错误 #91

Closed
grantchenhuarong opened this issue Apr 19, 2023 · 7 comments
Closed

在2080ti上运行 finetune提示错误 #91

grantchenhuarong opened this issue Apr 19, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@grantchenhuarong
Copy link

如果你遇到问题需要我们帮助,你可以从以下角度描述你的信息,以便于我们可以理解或者复现你的错误(学会如何提问不仅是能帮助我们理解你,也是一个自查的过程):
1、你使用了哪个脚本、使用的什么命令
bash finetune.sh
2、你的参数是什么(脚本参数、命令参数)
TOT_CUDA="0"
CUDAs=(${TOT_CUDA//,/ })
CUDA_NUM=${#CUDAs[@]}
PORT="12345"

DATA_PATH="./sample/merge_sample.json" #"../dataset/instruction/guanaco_non_chat_mini_52K-utf8.json" #"./sample/merge_sample.json"
OUTPUT_PATH="lora-Vicuna"
MODEL_PATH="/data/ftp/models/llama/7b"
lora_checkpoint="./lora-Vicuna/checkpoint-11600"
TEST_SIZE=1

CUDA_VISIBLE_DEVICES=${TOT_CUDA} torchrun --nproc_per_node=$CUDA_NUM --master_port=$PORT finetune.py
--data_path $DATA_PATH
--output_path $OUTPUT_PATH
--model_path $MODEL_PATH
--eval_steps 200
--save_steps 200
--test_size $TEST_SIZE

3、你是否修改过我们的代码

4、你用的哪个数据集
./sample/merge_sample.json

如果上面都是保持原样的,你可以描述“我用的哪个脚本、命令,跑了哪个任务,然后其他参数、数据都和你们一致”,便于我们平行地理解你们的问题。

然后你可以从环境的角度描述你的问题,这些问题我们在readme已经相关的问题及解决可能会有描述:
1、哪个操作系统
centos7
2、使用的什么显卡、多少张
2080ti(11gb) 1张
3、python的版本
3.8.16
4、python各种库的版本
transformers 4.28.1
其它使用requirements.txt安装

然后你也可以从运行的角度来描述你的问题:
1、报错信息是什么,是哪个代码的报错(可以将完整的报错信息都发给我们)
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 127 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debuggin

使用 TORCH_DISTRIBUTED_DEBUG=DETAIL bash finetune.sh 得到更加详细的是最后一层参数更新异常如下:
Parameter at index 127 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.

2、GPU、CPU是否工作正常
正常

同时你也可以看看issue,或者我们整理的信息里面有没有类似的问题相关的问题及解决

当然这只是个提问说明,你没有必要一一按照里面的内容来提问。

@grantchenhuarong
Copy link
Author

好吧,改单机版本跑就没有这个问题了

python finetune.py --data_path "./sample/merge_sample.json"
--output_path "lora-Vicuna"
--model_path "/data/ftp/models/llama/7b"
--eval_steps 200
--save_steps 200
--test_size 1

只是11GB跑起来,改小批次参数,不断的GPU。。。OOM
有真正在2080ti上跑起来finetune的兄弟姐妹们么?

@grantchenhuarong
Copy link
Author

确认是train的过程正常,但是在保存模型的时候提示异常了。。。

model.save_pretrained(OUTPUT_DIR)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.75 GiB total capacity; 9.56 GiB already allocated; 41.50 MiB free; 9.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@Facico
Copy link
Owner

Facico commented Apr 19, 2023

单卡跑torchrun是会有问题,建议多卡用torchrun,issue里面有很多类似的问题,如这个
下面这个问题是你训练的时候正常但是保存会OOM吗?你试试把transformers的版本降一下,或者用这个试试pip install git+https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560

@grantchenhuarong
Copy link
Author

transformers从4.28.1降回到4.28.0.dev版本了,结果类似。

训练的时候总共占用8.2GB,就是保存模型的时候,应该是在克隆权重的时候,爆内存了。显示如下。

File "finetune.py", line 278, in
model.save_pretrained(OUTPUT_DIR)
File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/peft/peft_model.py", line 103, in save_pretrained
output_state_dict = get_peft_model_state_dict(self, kwargs.get("state_dict", None))
File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/peft/utils/save_and_load.py", line 31, in get_peft_model_state_dict
state_dict = model.state_dict()
File "finetune.py", line 268, in
lambda self, *_, **__: get_peft_model_state_dict(self, old_state_dict())
File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict
module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars)
[Previous line repeated 4 more times]
File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1815, in state_dict
self._save_to_state_dict(destination, prefix, keep_vars)
File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 262, in _save_to_state_dict
weight_clone = self.weight.data.clone()

@grantchenhuarong
Copy link
Author

怀疑是在保存模型的时候,是否自动做了精度转换导致的?

@Facico
Copy link
Owner

Facico commented Apr 20, 2023

你看一下这个issue,可能是bitsandbytes版本的问题

@grantchenhuarong
Copy link
Author

谢谢,确实是,问题解决了。

(chinesevicuna) ai@ai-2080ti:/src/Chinese-Vicuna$ pip list|grep bitsandbytes
bitsandbytes 0.38.1
(chinesevicuna) ai@ai-2080ti:
/src/Chinese-Vicuna$ pip install bitsandbytes==0.37.2
Collecting bitsandbytes==0.37.2
Using cached bitsandbytes-0.37.2-py3-none-any.whl (84.2 MB)
Installing collected packages: bitsandbytes
Attempting uninstall: bitsandbytes
Found existing installation: bitsandbytes 0.38.1
Uninstalling bitsandbytes-0.38.1:
Successfully uninstalled bitsandbytes-0.38.1
Successfully installed bitsandbytes-0.37.2

@Facico Facico added the bug Something isn't working label Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants