We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1%| | 1/188 [03:01<9:25:06, 181.32s/it][rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 6] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634. mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down. mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5249152c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5249157a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5249158dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError' mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5249152c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5249157a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5249158dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f5248ddc119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634. mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1366106c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f136610ba80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f136610cdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError' mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1366106c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f136610ba80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f136610cdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f1365d90119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 0] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634. mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down. mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc960981c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc960986a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc960987dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError' mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc960981c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc960986a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc960987dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7fc96060b119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 3] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634. mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down. mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7b5c10ec62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7b5c113a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7b5c114dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError' mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7b5c10ec62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7b5c113a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7b5c114dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f7b5bd98119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 7] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634. mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down. mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9c256a5c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9c256aaa80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9c256abdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError' mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9c256a5c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9c256aaa80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9c256abdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f9c2532f119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 5] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634. mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down. mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f07d4979c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f07d497ea80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f07d497fdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError' mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f07d4979c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f07d497ea80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f07d497fdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f07d4603119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out. mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 2] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634. mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down. mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa1c8610c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa1c8615a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa1c8616dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError' mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa1c8610c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa1c8615a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa1c8616dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7fa1c829a119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 4] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634. mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down. mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f343049fc62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f34304a4a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f34304a5dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f347d1b4133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError' mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f343049fc62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f34304a4a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f34304a5dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0) mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f347d1b4133 in /lib/x86_64-linux-gnu/libc.so.6) mpi-rc1ou1aw-launcher: mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so) mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f3430129119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6) mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
accelerate launch --config_file config.yaml src/train.py --model_name_or_path /mnt/afs/qwen2 --stage sft --do_train true --finetuning_type full --dataset test --template qwen --cutoff_len 8192 --overwrite_cache true --preprocessing_num_workers 16 --output_dir /mnt/afs2/test --logging_steps 1 --save_strategy epoch --save_only_model true --plot_loss true --overwrite_output_dir true --per_device_train_batch_size 4 --gradient_accumulation_steps 16 --learning_rate 1.0e-5 --num_train_epochs 4.0 --lr_scheduler_type cosine --warmup_ratio 0.1 --fp16 true \
config.yaml 内容如下: compute_environment: LOCAL_MACHINE deepspeed_config: deepspeed_config_file: /mnt/config_files/deepspeed_configs/profile_7b.json deepspeed_multinode_launcher: pdsh deepspeed_hostfile: /etc/mpi/hostfile # hostfile的路径 zero3_init_flag: true distributed_type: DEEPSPEED fsdp_config: {} machine_rank: 0 main_training_function: main main_process_port: 21112 num_machines: 2 num_processes: 16 rdzv_backend: static same_network: true use_cpu: false
deepspeed_config_file内容如下: { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 2, "allgather_partitions": true, "allgather_bucket_size": 5e8, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 5e8, "contiguous_gradients": true, "round_robin_gradients": true } }
No response
The text was updated successfully, but these errors were encountered:
支持 accelerate
Sorry, something went wrong.
No branches or pull requests
Reminder
System Info
1%| | 1/188 [03:01<9:25:06, 181.32s/it][rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 6] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5249152c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5249157a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5249158dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5249152c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5249157a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5249158dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f5248ddc119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1366106c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f136610ba80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f136610cdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1366106c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f136610ba80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f136610cdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f1365d90119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 0] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc960981c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc960986a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc960987dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc960981c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc960986a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc960987dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7fc96060b119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 3] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7b5c10ec62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7b5c113a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7b5c114dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7b5c10ec62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7b5c113a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7b5c114dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f7b5bd98119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 7] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9c256a5c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9c256aaa80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9c256abdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9c256a5c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9c256aaa80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9c256abdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f9c2532f119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 5] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f07d4979c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f07d497ea80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f07d497fdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f07d4979c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f07d497ea80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f07d497fdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f07d4603119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 2] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa1c8610c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa1c8615a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa1c8616dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa1c8610c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa1c8615a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa1c8616dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7fa1c829a119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 4] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f343049fc62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f34304a4a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f34304a5dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f347d1b4133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f343049fc62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f34304a4a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f34304a5dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f347d1b4133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f3430129119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
Reproduction
accelerate launch --config_file config.yaml src/train.py --model_name_or_path /mnt/afs/qwen2
--stage sft
--do_train true
--finetuning_type full
--dataset test
--template qwen
--cutoff_len 8192
--overwrite_cache true
--preprocessing_num_workers 16
--output_dir /mnt/afs2/test
--logging_steps 1
--save_strategy epoch
--save_only_model true
--plot_loss true
--overwrite_output_dir true
--per_device_train_batch_size 4
--gradient_accumulation_steps 16
--learning_rate 1.0e-5
--num_train_epochs 4.0
--lr_scheduler_type cosine
--warmup_ratio 0.1
--fp16 true \
config.yaml 内容如下:
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: /mnt/config_files/deepspeed_configs/profile_7b.json
deepspeed_multinode_launcher: pdsh
deepspeed_hostfile: /etc/mpi/hostfile # hostfile的路径
zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_training_function: main
main_process_port: 21112
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
use_cpu: false
deepspeed_config_file内容如下:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true
}
}
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: