Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单机多卡一直报超时错误,请教下大佬有没有啥解决的办法啊 #53

Closed
xqmmy opened this issue Apr 10, 2023 · 11 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@xqmmy
Copy link

xqmmy commented Apr 10, 2023

[E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out.
localhost:10108:60893 [4] NCCL INFO [Service thread] Connection closed by localRank 4
Traceback (most recent call last):
localhost:10108:60278 [0] NCCL INFO comm 0x9ea4350 rank 4 nranks 7 cudaDev 4 busId a1000 - Abort COMPLETE
File "finetune.py", line 271, in
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
File "/root/miniconda3/envs/vicuna/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1800595 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801815 milliseconds before timing out.
localhost:10101:60894 [3] NCCL INFO [Service thread] Connection closed by localRank 3
localhost:10101:59694 [0] NCCL INFO comm 0xa8d5900 rank 3 nranks 7 cudaDev 3 busId 81000 - Abort COMPLETE
Traceback (most recent call last):
File "finetune.py", line 271, in
trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)

@xqmmy
Copy link
Author

xqmmy commented Apr 10, 2023

试了很多方法都不太行

@Facico
Copy link
Owner

Facico commented Apr 10, 2023

你的cuda和pytorch版本是能对上的吗,你可以到pytorch官网看看

@xqmmy
Copy link
Author

xqmmy commented Apr 10, 2023

你的cuda和pytorch版本是能对上的吗,你可以到pytorch官网看看

是对应的

@Facico
Copy link
Owner

Facico commented Apr 10, 2023

你可以参考指引来问问题。像上面,你应该把你的pytorch版本和CUDA版本告诉我,就一个报错信息我很难去猜测你的具体情况。

@xqmmy
Copy link
Author

xqmmy commented Apr 10, 2023

你可以参考指引来问问题。像上面,你应该把你的pytorch版本和CUDA版本告诉我,就一个报错信息我很难去猜测你的具体情况。

好的,抱歉我没描述清楚
torch1.13.1 cuda11.7
使用的是完整的merge数据和原始的多卡finetune脚本
使用的A4000显卡、7张

@Facico
Copy link
Owner

Facico commented Apr 10, 2023

你配置对应的pytorch安装脚本应该是“pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117” 看看有没有误下成cpu版本。

你使用单卡的时候能成功吗?多卡的时候使用双卡、四卡等配置看看是否会有问题,这里有个类似的issue

可以在跑的时候加上export NCCL_DEBUG=INFO,看看有没有更详细的报错输出。
或者你可以看看这个有没有帮助

@zhoujx4
Copy link

zhoujx4 commented Apr 10, 2023

我跟你问题一样,超时,单机多卡的
改了bios,关闭ACS 后解决问题,你参考下

@zhoujx4
Copy link

zhoujx4 commented Apr 10, 2023

具体可以参考下这个:https://www.modb.pro/db/617940

@Facico Facico added the bug Something isn't working label Apr 11, 2023
@xqmmy
Copy link
Author

xqmmy commented Apr 13, 2023

换了3090就没问题,不知道啥原因

@xqmmy
Copy link
Author

xqmmy commented Apr 13, 2023

我跟你问题一样,超时,单机多卡的 改了bios,关闭ACS 后解决问题,你参考下

谢谢解答,我再试试

@xqmmy xqmmy closed this as completed Apr 14, 2023
@Facico Facico added the good first issue Good for newcomers label Apr 21, 2023
@thelongestusernameofall

solution in NVIDIA/nccl#426 works.

export NCCL_IB_GID_INDEX=3 solved my problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants