微调模型时数据集加载报错 #4814

xiao-liya · 2024-07-14T11:51:09Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.7.2.dev0 [137/1363]
Platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Python version: 3.10.13
PyTorch version: 2.3.0+cu121 (GPU)
Transformers version: 4.41.2
Datasets version: 2.18.0
Accelerate version: 0.30.1
PEFT version: 0.11.1
TRL version: 0.9.3
GPU type: NVIDIA RTX A6000
DeepSpeed version: 0.14.0
Bitsandbytes version: 0.43.0
vLLM version: 0.4.3

Reproduction

运行命令：
CUDA_VISIBLE_DEVICES=1,2,3 deepspeed --num_gpus=3 --master_port=9901 src/train.py
--deepspeed ds_config.json
--stage sft
--do_train True
--model_name_or_path saves/Custom/full/train_Qwen2-7B-instruct_book_shoucheng_add_full_pt1
--finetuning_type lora
--template qwen
--flash_attn auto
--dataset_dir data
--dataset alpaca_zh_demo,glaive_toolcall_zh_demo,sft_diagnose1,sft_diagnose2
--cutoff_len 4096
--learning_rate 5e-05
--num_train_epochs 1.0
--max_samples 300000
--per_device_train_batch_size 4
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 3.0
--logging_steps 5
--save_steps 3000
--warmup_steps 0
--optim adamw_torch
--packing False
--report_to none
--output_dir saves/Custom/lora/train_Qwen2-7B-instruct_book_shoucheng_add_full_add_diagnose_lora_sft1
--fp16 True
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.1
--lora_target all
--plot_loss True

报错
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelI
n=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA ker
nels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught c
ollective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseco
nds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fafdf1aa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fafdf1aefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fafdf1b031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb02cd26850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fafdf1aa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fafdf1aefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fafdf1b031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb02cd26850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7fafdee32e33 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelI
n=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA ker
nels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught c
ollective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseco
nds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fbe9dbaa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbe9dbaefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbe9dbb031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fbe9dbaa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbe9dbaefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbe9dbb031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7fbe9d832e33 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)

Expected behavior

No response

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-07-14T11:54:15Z

NCCL 问题，联系硬件厂商

github-actions bot added the pending This problem is yet to be addressed label Jul 14, 2024

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Jul 14, 2024

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

微调模型时数据集加载报错 #4814

微调模型时数据集加载报错 #4814

xiao-liya commented Jul 14, 2024

hiyouga commented Jul 14, 2024

微调模型时数据集加载报错 #4814

微调模型时数据集加载报错 #4814

Comments

xiao-liya commented Jul 14, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Jul 14, 2024