-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在2080ti上运行 finetune提示错误 #91
Comments
好吧,改单机版本跑就没有这个问题了 python finetune.py --data_path "./sample/merge_sample.json" 只是11GB跑起来,改小批次参数,不断的GPU。。。OOM |
确认是train的过程正常,但是在保存模型的时候提示异常了。。。 model.save_pretrained(OUTPUT_DIR) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.75 GiB total capacity; 9.56 GiB already allocated; 41.50 MiB free; 9.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF |
单卡跑torchrun是会有问题,建议多卡用torchrun,issue里面有很多类似的问题,如这个 |
transformers从4.28.1降回到4.28.0.dev版本了,结果类似。 训练的时候总共占用8.2GB,就是保存模型的时候,应该是在克隆权重的时候,爆内存了。显示如下。 File "finetune.py", line 278, in |
怀疑是在保存模型的时候,是否自动做了精度转换导致的? |
你看一下这个issue,可能是bitsandbytes版本的问题 |
谢谢,确实是,问题解决了。 (chinesevicuna) ai@ai-2080ti: |
如果你遇到问题需要我们帮助,你可以从以下角度描述你的信息,以便于我们可以理解或者复现你的错误(学会如何提问不仅是能帮助我们理解你,也是一个自查的过程):
1、你使用了哪个脚本、使用的什么命令
bash finetune.sh
2、你的参数是什么(脚本参数、命令参数)
TOT_CUDA="0"
CUDAs=(${TOT_CUDA//,/ })
CUDA_NUM=${#CUDAs[@]}
PORT="12345"
DATA_PATH="./sample/merge_sample.json" #"../dataset/instruction/guanaco_non_chat_mini_52K-utf8.json" #"./sample/merge_sample.json"
OUTPUT_PATH="lora-Vicuna"
MODEL_PATH="/data/ftp/models/llama/7b"
lora_checkpoint="./lora-Vicuna/checkpoint-11600"
TEST_SIZE=1
CUDA_VISIBLE_DEVICES=${TOT_CUDA} torchrun --nproc_per_node=$CUDA_NUM --master_port=$PORT finetune.py
--data_path $DATA_PATH
--output_path $OUTPUT_PATH
--model_path $MODEL_PATH
--eval_steps 200
--save_steps 200
--test_size $TEST_SIZE
3、你是否修改过我们的代码
未
4、你用的哪个数据集
./sample/merge_sample.json
如果上面都是保持原样的,你可以描述“我用的哪个脚本、命令,跑了哪个任务,然后其他参数、数据都和你们一致”,便于我们平行地理解你们的问题。
然后你可以从环境的角度描述你的问题,这些问题我们在readme已经相关的问题及解决可能会有描述:
1、哪个操作系统
centos7
2、使用的什么显卡、多少张
2080ti(11gb) 1张
3、python的版本
3.8.16
4、python各种库的版本
transformers 4.28.1
其它使用requirements.txt安装
然后你也可以从运行的角度来描述你的问题:
1、报错信息是什么,是哪个代码的报错(可以将完整的报错信息都发给我们)
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the
forward
function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiplecheckpoint
functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.Parameter at index 127 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debuggin
使用 TORCH_DISTRIBUTED_DEBUG=DETAIL bash finetune.sh 得到更加详细的是最后一层参数更新异常如下:
Parameter at index 127 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
2、GPU、CPU是否工作正常
正常
同时你也可以看看issue,或者我们整理的信息里面有没有类似的问题相关的问题及解决
当然这只是个提问说明,你没有必要一一按照里面的内容来提问。
The text was updated successfully, but these errors were encountered: