Meet c10::DistBackendError when finetuning Qwen2-VL with video dataset #5417

htlou · 2024-09-12T01:33:51Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.4.dev0
Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
Python version: 3.11.9
PyTorch version: 2.4.0 (GPU)
Transformers version: 4.45.0.dev0
Datasets version: 2.20.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA H800
DeepSpeed version: 0.14.4
vLLM version: 0.5.3.post1

Reproduction

I used script:

export WANDB_API_KEY=""
export CUDA_HOME=$CONDA_PREFIX
model_path=Qwen/Qwen2-VL-7B-Instruct
dataset=mllm_video_panda
outputdir=models/long_debug
gradient_accumulation_steps=1
per_device_batchsize=4
epoch_num=3
learning_rate=1.5e-05
MASTER_PORT_START=10000
MASTER_PORT_END=65535
MASTER_PORT="$(
	comm -23 \
		<(seq "${MASTER_PORT_START}" "${MASTER_PORT_END}" | sort) \
		<(ss -Htan | awk '{ print $4 }' | awk -F ':' '{ print $NF }' | sort -u) |
		shuf | head -n 1
)"
DEEPSPEED_ARGS=()
DEEPSPEED_ARGS+=("--master_port" "${MASTER_PORT}")
deepspeed  "${DEEPSPEED_ARGS[@]}" \
 		src/train.py \
	--model_name_or_path $model_path  --stage sft \
	--dataset $dataset \
	--finetuning_type  full \
	--overwrite_cache  true \
	--flash_attn fa2 \
	--preprocessing_num_workers 16 \
	--preprocessing_batch_size 16 \
	--template qwen2_vl \
	--output_dir $outputdir \
	--bf16  true  \
	--lr_scheduler_type  cosine \
	--do_train  true  \
	--do_eval true \
	--packing false \
	--gradient_accumulation_steps  $gradient_accumulation_steps \
	--gradient_checkpointing  true \
	--learning_rate  $learning_rate \
	--log_level  passive \
	--logging_steps  10 \
	--logging_strategy  steps \
	--max_steps  -1 \
	--num_train_epochs $epoch_num \
	--report_to wandb \
	--weight_decay 0.01 \
	--cutoff_len 8192 \
	--warmup_ratio 0.02 \
	--eval_steps 200 \
	--val_size 0.01 \
	--evaluation_strategy steps \
	--overwrite_output_dir  true  \
	--per_device_train_batch_size  $per_device_batchsize \
	--remove_unused_columns  true \
	--save_strategy epoch \
	--plot_loss \
	--save_total_limit 3 \
	--save_safetensors  true  \
	--deepspeed=examples/deepspeed/ds_z3_config.json | tee ${outputdir}/output.log

And received the error message below from the command line when tokenizing the dataset:

Running tokenizer on dataset (num_proc=16):  98%|██████████████████████████████████████████████████████████████████████████████████████████████  | 26537/27092 [29:57<02:18,  4.01 examples/s]
[rank7]:[E912 09:28:22.636564685 ProcessGroupNCCL.cpp:607] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800013 milliseconds before timing out.                                                                                                                                          
[rank6]:[E912 09:28:22.636621604 ProcessGroupNCCL.cpp:607] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800007 milliseconds before timing out.                                                                                                                                          
[rank4]:[E912 09:28:22.638776510 ProcessGroupNCCL.cpp:607] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800007 milliseconds before timing out.                                                                                                                                          
[rank2]:[E912 09:28:22.639509869 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800031 milliseconds before timing out.                                                                                                                                          
[rank7]:[E912 09:28:22.641396375 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 7] Exception (either an error or timeout) detected by watchdog at work: 2, last enqueued NCCL work: 2, las
t completed NCCL work: 1.                                                                                                                                                                     
[rank7]:[E912 09:28:22.641408298 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 7] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.                      
[rank7]:[E912 09:28:22.641414460 ProcessGroupNCCL.cpp:621] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.                                                                                                                                                        
[rank2]:[E912 09:28:22.641404234 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 2, last enqueued NCCL work: 2, las
t completed NCCL work: 1.                                                                                                                                                                     
[rank7]:[E912 09:28:22.641418786 ProcessGroupNCCL.cpp:627] [Rank 7] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank4]:[E912 09:28:22.641413276 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 4] Exception (either an error or timeout) detected by watchdog at work: 2, last enqueued NCCL work: 2, las
t completed NCCL work: 1.                                                                                                                                                                     
[rank6]:[E912 09:28:22.641402409 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 6] Exception (either an error or timeout) detected by watchdog at work: 2, last enqueued NCCL work: 2, las
t completed NCCL work: 1.                                                                                                                                                                     
[rank2]:[E912 09:28:22.641421137 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.                      
[rank4]:[E912 09:28:22.641427636 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 4] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.                      
[rank6]:[E912 09:28:22.641435771 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 6] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.                      
[rank4]:[E912 09:28:22.641440200 ProcessGroupNCCL.cpp:621] [Rank 4] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.                                                                                                                                                        
[rank2]:[E912 09:28:22.641439757 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.                                                                                                                                                        
[rank6]:[E912 09:28:22.641445702 ProcessGroupNCCL.cpp:621] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.                                                                                                                                                        
[rank4]:[E912 09:28:22.641447829 ProcessGroupNCCL.cpp:627] [Rank 4] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank2]:[E912 09:28:22.641450799 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank6]:[E912 09:28:22.641453375 ProcessGroupNCCL.cpp:627] [Rank 6] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank6]:[E912 09:28:22.645647917 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation 
timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800007 milliseconds before timing out.                                                     
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538456841/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):                          
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x1554fbaf3f86 in /home/miniconda3/envs/hantao_tiv/lib/python3.11/site-packages/torch/lib/libc10.so)         
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x1554fce0fa42 in /home/miniconda3/envs/hant
ao_tiv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)                                                                                                                               
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x1554fce16483 in /home/miniconda3/envs/hantao_tiv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)           
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x1554fce1886c in /home/miniconda3/envs/hantao_tiv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)          
frame #4: <unknown function> + 0xd3b55 (0x15554d0f0b55 in /home/miniconda3/envs/hantao_tiv/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6)                    
frame #5: <unknown function> + 0x94ac3 (0x15555527dac3 in /lib/x86_64-linux-gnu/libc.so.6)                                                                                                    
frame #6: <unknown function> + 0x126a40 (0x15555530fa40 in /lib/x86_64-linux-gnu/libc.so.6)

I tried different dataset and seeds, but they all fail at the same time (~30 min), so I'm wondering if there's something wrong with qwen2-vl or the fine-tuning code.

Expected behavior

Process the dataset and run the finetuning normally without error

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2024-09-12T04:34:08Z

please see the example scripts and increase the ddp_timeout parameter

alphanlp · 2024-10-05T18:00:03Z

please see the example scripts and increase the ddp_timeout parameter

Why does a timeout occur? Can you explain the real reason?

alphanlp · 2024-10-05T18:03:29Z

increase the ddp_timeout parameter can not solve my error when use script "llamafactory-cli train examples/train_full/qwen2vl_full_sft.yaml" to start training.

Errors always occur:
[INFO|trainer.py:2252] 2024-10-05 17:29:32,770 >> Number of trainable parameters = 7,615,616,512
20%|███████████████████████████████████▍ | 3/15 [01:02<04:03, 20.33s/it]/data/llmodel/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/nn/modules/conv.py:605: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return F.conv3d(
[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1369, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=68124672, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1369, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=68124672, Timeout(ms)=600000) ran for 600012 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1369, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=68124672, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
[rank5]:[E ProcessGroupNCCL.cpp:563] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1369, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=68124672, Timeout(ms)=600000) ran for 600024 milliseconds before timing out.
[rank4]:[E ProcessGroupNCCL.cpp:563] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1369, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=68124672, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1369, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=68124672, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1369, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=68124672, Timeout(ms)=600000) ran for 600087 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:1537] [PG 1 Rank 3] Timeout at NCCL work: 1369, last enqueued NCCL work: 1371, last completed NCCL work: 1368.
[rank3]:[E ProcessGroupNCCL.cpp:577] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:583] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1414] [PG 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1369, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=68124672, Timeout(ms)=600000) ran for 600019 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9d7797a897 in /data/llmodel/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f9d2ba4f1b2 in /data/llmodel/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f9d2ba53fd0 in /data/llmodel/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f9d2ba5531c in /data/llmodel/miniconda3/envs/llamafactory/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f9d774b0253 in /data/llmodel/miniconda3/envs/llamafactory/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f9d78babac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f9d78c3d850 in /lib/x86_64-linux-gnu/libc.so.6)

github-actions bot added the pending This problem is yet to be addressed label Sep 12, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Sep 12, 2024

hiyouga closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meet c10::DistBackendError when finetuning Qwen2-VL with video dataset #5417

Meet c10::DistBackendError when finetuning Qwen2-VL with video dataset #5417

htlou commented Sep 12, 2024

hiyouga commented Sep 12, 2024

alphanlp commented Oct 5, 2024

alphanlp commented Oct 5, 2024

Meet c10::DistBackendError when finetuning Qwen2-VL with video dataset #5417

Meet c10::DistBackendError when finetuning Qwen2-VL with video dataset #5417

Comments

htlou commented Sep 12, 2024

Reminder

System Info

Reproduction

Expected behavior

Others

hiyouga commented Sep 12, 2024

alphanlp commented Oct 5, 2024

alphanlp commented Oct 5, 2024