Question about CUDA/NVCC setups #22

rqzhangberkeley · 2024-08-16T15:57:45Z

When I tried to reproduce the results in the RLHFlow paper, I met some errors. This happens when I run get_rewards.py using 8 A100s.

[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=3, Timeout(ms)=1800000) ran for 1800005 milliseconds before timing out.

Initially, I thought this was because the version of CUDA or NVCC was incorrect. However, fixing the version of NVCC / CUDA does not help.

Finally this was solved by adding
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
to the shell file.

Suggestion: I am wondering whether we can add a document to record common errors we met.

WeiXiongUST · 2024-08-17T01:17:00Z

thanks for bringing this to us. Could you create a PR to initialize an ./Online_RLHF/error_record.md?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about CUDA/NVCC setups #22

Question about CUDA/NVCC setups #22

rqzhangberkeley commented Aug 16, 2024

WeiXiongUST commented Aug 17, 2024

Question about CUDA/NVCC setups #22

Question about CUDA/NVCC setups #22

Comments

rqzhangberkeley commented Aug 16, 2024

WeiXiongUST commented Aug 17, 2024