You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
报错
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelI
n=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA ker
nels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught c
ollective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseco
nds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fafdf1aa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fafdf1aefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fafdf1b031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb02cd26850 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fafdf1aa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fafdf1aefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fafdf1b031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb02cd26850 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7fafdee32e33 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelI
n=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA ker
nels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught c
ollective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseco
nds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fbe9dbaa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbe9dbaefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbe9dbb031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fbe9dbaa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbe9dbaefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbe9dbb031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7fbe9d832e33 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered:
Reminder
System Info
llamafactory
version: 0.7.2.dev0 [137/1363]Reproduction
运行命令:
CUDA_VISIBLE_DEVICES=1,2,3 deepspeed --num_gpus=3 --master_port=9901 src/train.py
--deepspeed ds_config.json
--stage sft
--do_train True
--model_name_or_path saves/Custom/full/train_Qwen2-7B-instruct_book_shoucheng_add_full_pt1
--finetuning_type lora
--template qwen
--flash_attn auto
--dataset_dir data
--dataset alpaca_zh_demo,glaive_toolcall_zh_demo,sft_diagnose1,sft_diagnose2
--cutoff_len 4096
--learning_rate 5e-05
--num_train_epochs 1.0
--max_samples 300000
--per_device_train_batch_size 4
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 3.0
--logging_steps 5
--save_steps 3000
--warmup_steps 0
--optim adamw_torch
--packing False
--report_to none
--output_dir saves/Custom/lora/train_Qwen2-7B-instruct_book_shoucheng_add_full_add_diagnose_lora_sft1
--fp16 True
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.1
--lora_target all
--plot_loss True
报错
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelI
n=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA ker
nels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught c
ollective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseco
nds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fafdf1aa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fafdf1aefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fafdf1b031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb02cd26850 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fafdf1aa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fafdf1aefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fafdf1b031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb02cd26850 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7fafdee32e33 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelI
n=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA ker
nels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught c
ollective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseco
nds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fbe9dbaa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbe9dbaefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbe9dbb031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fbe9dbaa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbe9dbaefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbe9dbb031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7fbe9d832e33 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)
Expected behavior
No response
Others
No response
The text was updated successfully, but these errors were encountered: