-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preprocess_dataset dist.barrier crashed with NCCL communicator Socket Timeout #360
Labels
solved
This problem has been already solved
Comments
increase NCCL timeout threshold or using dataset streaming |
Thanks. |
hiyouga
added
solved
This problem has been already solved
and removed
pending
This problem is yet to be addressed
labels
Aug 10, 2023
请问如何提高 NCCL 超时的阈值呢 |
谢谢! |
Closed
怎么使用呀,nccl超时怎么弄呀,我用了ddp_time 发现加载数据的时候有两个run_tokenizre_data,进度条第一个能顺利加载大量数据,但是第二个进度条就不行了 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
During training Qwen-7B (sft) using this commit 2780792, preprocess_dataset crashed with NCCL communicator Socket Timeout. Could you please give some advice for this issue?
The text was updated successfully, but these errors were encountered: