We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When I tried to reproduce the results in the RLHFlow paper, I met some errors. This happens when I run get_rewards.py using 8 A100s.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=3, Timeout(ms)=1800000) ran for 1800005 milliseconds before timing out.
Initially, I thought this was because the version of CUDA or NVCC was incorrect. However, fixing the version of NVCC / CUDA does not help.
Finally this was solved by adding export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 to the shell file.
Suggestion: I am wondering whether we can add a document to record common errors we met.
The text was updated successfully, but these errors were encountered:
thanks for bringing this to us. Could you create a PR to initialize an ./Online_RLHF/error_record.md?
Sorry, something went wrong.
No branches or pull requests
When I tried to reproduce the results in the RLHFlow paper, I met some errors. This happens when I run get_rewards.py using 8 A100s.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=3, Timeout(ms)=1800000) ran for 1800005 milliseconds before timing out.
Initially, I thought this was because the version of CUDA or NVCC was incorrect. However, fixing the version of NVCC / CUDA does not help.
Finally this was solved by adding
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
to the shell file.
Suggestion: I am wondering whether we can add a document to record common errors we met.
The text was updated successfully, but these errors were encountered: