单机多卡一直报超时错误，请教下大佬有没有啥解决的办法啊 #53

xqmmy · 2023-04-10T01:01:22Z

[E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out.
localhost:10108:60893 [4] NCCL INFO [Service thread] Connection closed by localRank 4
Traceback (most recent call last):
localhost:10108:60278 [0] NCCL INFO comm 0x9ea4350 rank 4 nranks 7 cudaDev 4 busId a1000 - Abort COMPLETE
File "finetune.py", line 271, in
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
File "/root/miniconda3/envs/vicuna/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801815 milliseconds before timing out.
localhost:10101:60894 [3] NCCL INFO [Service thread] Connection closed by localRank 3
localhost:10101:59694 [0] NCCL INFO comm 0xa8d5900 rank 3 nranks 7 cudaDev 3 busId 81000 - Abort COMPLETE
Traceback (most recent call last):
File "finetune.py", line 271, in
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)

xqmmy · 2023-04-10T01:05:50Z

试了很多方法都不太行

Facico · 2023-04-10T01:11:29Z

你的cuda和pytorch版本是能对上的吗，你可以到pytorch官网看看

xqmmy · 2023-04-10T01:14:59Z

你的cuda和pytorch版本是能对上的吗，你可以到pytorch官网看看

是对应的

Facico · 2023-04-10T01:18:27Z

你可以参考指引来问问题。像上面，你应该把你的pytorch版本和CUDA版本告诉我，就一个报错信息我很难去猜测你的具体情况。

xqmmy · 2023-04-10T01:24:13Z

你可以参考指引来问问题。像上面，你应该把你的pytorch版本和CUDA版本告诉我，就一个报错信息我很难去猜测你的具体情况。

好的，抱歉我没描述清楚
torch1.13.1 cuda11.7
使用的是完整的merge数据和原始的多卡finetune脚本
使用的A4000显卡、7张

Facico · 2023-04-10T02:01:12Z

你配置对应的pytorch安装脚本应该是“pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117” 看看有没有误下成cpu版本。

你使用单卡的时候能成功吗？多卡的时候使用双卡、四卡等配置看看是否会有问题，这里有个类似的issue

可以在跑的时候加上export NCCL_DEBUG=INFO，看看有没有更详细的报错输出。
或者你可以看看这个有没有帮助

zhoujx4 · 2023-04-10T09:10:38Z

我跟你问题一样，超时，单机多卡的
改了bios，关闭ACS 后解决问题，你参考下

zhoujx4 · 2023-04-10T09:12:51Z

具体可以参考下这个：https://www.modb.pro/db/617940

xqmmy · 2023-04-13T01:05:26Z

换了3090就没问题，不知道啥原因

xqmmy · 2023-04-13T01:05:52Z

我跟你问题一样，超时，单机多卡的改了bios，关闭ACS 后解决问题，你参考下

谢谢解答，我再试试

thelongestusernameofall · 2023-11-22T02:04:30Z

solution in NVIDIA/nccl#426 works.

export NCCL_IB_GID_INDEX=3 solved my problem.

Facico added the bug Something isn't working label Apr 11, 2023

xqmmy closed this as completed Apr 14, 2023

Facico added the good first issue Good for newcomers label Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单机多卡一直报超时错误，请教下大佬有没有啥解决的办法啊 #53

单机多卡一直报超时错误，请教下大佬有没有啥解决的办法啊 #53

xqmmy commented Apr 10, 2023

xqmmy commented Apr 10, 2023

Facico commented Apr 10, 2023

xqmmy commented Apr 10, 2023

Facico commented Apr 10, 2023

xqmmy commented Apr 10, 2023

Facico commented Apr 10, 2023

zhoujx4 commented Apr 10, 2023

zhoujx4 commented Apr 10, 2023

xqmmy commented Apr 13, 2023

xqmmy commented Apr 13, 2023

thelongestusernameofall commented Nov 22, 2023

单机多卡一直报超时错误，请教下大佬有没有啥解决的办法啊 #53

单机多卡一直报超时错误，请教下大佬有没有啥解决的办法啊 #53

Comments

xqmmy commented Apr 10, 2023

xqmmy commented Apr 10, 2023

Facico commented Apr 10, 2023

xqmmy commented Apr 10, 2023

Facico commented Apr 10, 2023

xqmmy commented Apr 10, 2023

Facico commented Apr 10, 2023

zhoujx4 commented Apr 10, 2023

zhoujx4 commented Apr 10, 2023

xqmmy commented Apr 13, 2023

xqmmy commented Apr 13, 2023

thelongestusernameofall commented Nov 22, 2023