Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

微调模型时数据集加载报错 #4814

Closed
1 task done
xiao-liya opened this issue Jul 14, 2024 · 1 comment
Closed
1 task done

微调模型时数据集加载报错 #4814

xiao-liya opened this issue Jul 14, 2024 · 1 comment
Labels
wontfix This will not be worked on

Comments

@xiao-liya
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.7.2.dev0 [137/1363]
  • Platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
  • Python version: 3.10.13
  • PyTorch version: 2.3.0+cu121 (GPU)
  • Transformers version: 4.41.2
  • Datasets version: 2.18.0
  • Accelerate version: 0.30.1
  • PEFT version: 0.11.1
  • TRL version: 0.9.3
  • GPU type: NVIDIA RTX A6000
  • DeepSpeed version: 0.14.0
  • Bitsandbytes version: 0.43.0
  • vLLM version: 0.4.3

Reproduction

运行命令:
CUDA_VISIBLE_DEVICES=1,2,3 deepspeed --num_gpus=3 --master_port=9901 src/train.py
--deepspeed ds_config.json
--stage sft
--do_train True
--model_name_or_path saves/Custom/full/train_Qwen2-7B-instruct_book_shoucheng_add_full_pt1
--finetuning_type lora
--template qwen
--flash_attn auto
--dataset_dir data
--dataset alpaca_zh_demo,glaive_toolcall_zh_demo,sft_diagnose1,sft_diagnose2
--cutoff_len 4096
--learning_rate 5e-05
--num_train_epochs 1.0
--max_samples 300000
--per_device_train_batch_size 4
--gradient_accumulation_steps 8
--lr_scheduler_type cosine
--max_grad_norm 3.0
--logging_steps 5
--save_steps 3000
--warmup_steps 0
--optim adamw_torch
--packing False
--report_to none
--output_dir saves/Custom/lora/train_Qwen2-7B-instruct_book_shoucheng_add_full_add_diagnose_lora_sft1
--fp16 True
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.1
--lora_target all
--plot_loss True

报错
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelI
n=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA ker
nels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught c
ollective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseco
nds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fafdf1aa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fafdf1aefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fafdf1b031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb02cd26850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800063 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fafdf1aa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fafdf1aefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fafdf1b031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb02cd26850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb02b97a897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7fafdee32e33 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fb02aedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fb02cc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:563] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelI
n=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 2] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA ker
nels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:583] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught c
ollective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseco
nds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fbe9dbaa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbe9dbaefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbe9dbb031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout
: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800035 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (
0x7fbe9dbaa1b2 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fbe9dbaefd0 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/sit
e-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fbe9dbb031c in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/si
te-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbeea29e897 in /home/ps/miniconda3/envs/llamafactory/lib/pytho
n3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe32e33 (0x7fbe9d832e33 in /home/ps/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib
/libtorch_cuda.so)
frame #2: + 0xdc253 (0x7fbee98dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x7fbeeb494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x7fbeeb526850 in /lib/x86_64-linux-gnu/libc.so.6)

Expected behavior

No response

Others

No response

@github-actions github-actions bot added the pending This problem is yet to be addressed label Jul 14, 2024
@hiyouga
Copy link
Owner

hiyouga commented Jul 14, 2024

NCCL 问题,联系硬件厂商

@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Jul 14, 2024
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Jul 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants