Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多机多卡训练 想问下现在是不支持accelerate launch训练吗 #5558

Closed
1 task done
Hansen06 opened this issue Sep 27, 2024 · 1 comment
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@Hansen06
Copy link

Hansen06 commented Sep 27, 2024

Reminder

  • I have read the README and searched the existing issues.

System Info

1%| | 1/188 [03:01<9:25:06, 181.32s/it][rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 6] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5249152c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5249157a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5249158dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600052 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5249152c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f5249157a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5249158dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5247e79897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f5248ddc119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f5294c10b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f529609c609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f5295e67133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 1] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1366106c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f136610ba80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f136610cdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f1366106c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f136610ba80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f136610cdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1364e2d897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f1365d90119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f13b1bc4b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f13b3050609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f13b2e1b133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 0] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:577] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:583] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc960981c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc960986a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc960987dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600005 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc960981c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc960986a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc960987dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc95f6a8897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7fc96060b119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7fc9ac43fb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7fc9ad8cb609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7fc9ad696133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 3] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7b5c10ec62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7b5c113a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7b5c114dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f7b5c10ec62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f7b5c113a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f7b5c114dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7b5ae35897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f7b5bd98119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f7ba7bccb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f7ba9058609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f7ba8e23133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 7] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9c256a5c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9c256aaa80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9c256abdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600056 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9c256a5c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9c256aaa80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9c256abdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9c243cc897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f9c2532f119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f9c71163b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f9c725ef609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f9c723ba133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 5] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:577] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:583] [Rank 5] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank5]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f07d4979c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f07d497ea80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f07d497fdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600097 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f07d4979c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f07d497ea80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f07d497fdcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f07d36a0897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f07d4603119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f0820437b55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f08218c3609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7f082168e133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 2] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa1c8610c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa1c8615a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa1c8616dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600083 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fa1c8610c62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fa1c8615a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fa1c8616dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa1c7337897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7fa1c829a119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7fa2140ceb55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7fa21555a609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #4: clone + 0x43 (0x7fa215325133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out.
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 4] Timeout at NCCL work: 635, last enqueued NCCL work: 651, last completed NCCL work: 634.
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:577] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:583] [Rank 4] To avoid data inconsistency, we are taking the entire process down.
mpi-rc1ou1aw-launcher: [rank4]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f343049fc62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f34304a4a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f34304a5dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f347d1b4133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: terminate called after throwing an instance of 'c10::DistBackendError'
mpi-rc1ou1aw-launcher: what(): [PG 1 Rank 4] Process group watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=635, OpType=ALLREDUCE, NumelIn=576156928, NumelOut=576156928, Timeout(ms)=600000) ran for 600043 milliseconds before timing out.
mpi-rc1ou1aw-launcher: Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f343049fc62 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f34304a4a80 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f34304a5dcc in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #4: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #5: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)
mpi-rc1ou1aw-launcher: frame #6: clone + 0x43 (0x7f347d1b4133 in /lib/x86_64-linux-gnu/libc.so.6)
mpi-rc1ou1aw-launcher:
mpi-rc1ou1aw-launcher: Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
mpi-rc1ou1aw-launcher: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f342f1c6897 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libc10.so)
mpi-rc1ou1aw-launcher: frame #1: + 0xe32119 (0x7f3430129119 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
mpi-rc1ou1aw-launcher: frame #2: + 0xd3b55 (0x7f347bf5db55 in /mnt/afs2/yaohaishen/my_conda_env/llm_mul/bin/../lib/libstdc++.so.6)
mpi-rc1ou1aw-launcher: frame #3: + 0x8609 (0x7f347d3e9609 in /lib/x86_64-linux-gnu/libpthread.so.0)

Reproduction

accelerate launch --config_file config.yaml src/train.py --model_name_or_path /mnt/afs/qwen2
--stage sft
--do_train true
--finetuning_type full
--dataset test
--template qwen
--cutoff_len 8192
--overwrite_cache true
--preprocessing_num_workers 16
--output_dir /mnt/afs2/test
--logging_steps 1
--save_strategy epoch
--save_only_model true
--plot_loss true
--overwrite_output_dir true
--per_device_train_batch_size 4
--gradient_accumulation_steps 16
--learning_rate 1.0e-5
--num_train_epochs 4.0
--lr_scheduler_type cosine
--warmup_ratio 0.1
--fp16 true \

config.yaml 内容如下:
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: /mnt/config_files/deepspeed_configs/profile_7b.json
deepspeed_multinode_launcher: pdsh
deepspeed_hostfile: /etc/mpi/hostfile # hostfile的路径
zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_training_function: main
main_process_port: 21112
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
use_cpu: false

deepspeed_config_file内容如下:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true
}
}

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 27, 2024
@hiyouga
Copy link
Owner

hiyouga commented Sep 27, 2024

支持 accelerate

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 27, 2024
@hiyouga hiyouga closed this as completed Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

2 participants