Qlora+sft，微调qwen1.5-72b-int4，卡着没反应，最后Watchdog caught collective operation timeout: WorkNCCL(SeqNum=488, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802602 milliseconds before timing out #2702

TestNLP · 2024-03-05T03:43:38Z

Reminder

I have read the README and searched the existing issues.

Reproduction

torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6005 src/train_bash.py --deepspeed ds_zero2.json --model_name_or_path /work1/models/qwen2/qwen1.5_72b_int4/ --stage sft --template qwen --quantization_bit 4 --finetuning_type lora --lora_target q_proj,v_proj --do_train --flash_attn True --dataset factory_train_final_20240304_6k --output_dir output_1 --overwrite_output_dir --dataloader_num_workers 120 --gradient_accumulation_steps 8 --gradient_checkpointing True --bf16 True --num_train_epochs 2 --learning_rate 1e-5 --lr_scheduler_type "cosine" --warmup_ratio 0.01 --adam_beta2 0.95 --weight_decay 0.1 --evaluation_strategy epoch --eval_steps 100 --evaluation_strategy "no" --eval_accumulation_steps 1 --bf16_full_eval True --prediction_loss_only True --save_strategy epoch --save_steps 100 --save_total_limit 10 --logging_steps 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --log_level info --max_length 6144 --cutoff_len 6144 --ddp_timeout 180000000 --max_new_tokens 2048

ds_zero2.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"overlap_comm": true,
"contiguous_gradients": true
}
}

Expected behavior

No response

System Info

transformers 4.37.2
auto-gptq 0.5.1+cu117
bert4torch 0.3.9
deepspeed 0.12.5
torch 2.0.1

Linux系统

Others

hiyouga · 2024-03-05T04:09:14Z

Try #1683

JerryDaHeLian · 2024-03-20T00:53:24Z

我是通过：
--preprocessing_num_workers 128
解决的。

hiyouga added the pending This problem is yet to be addressed label Mar 5, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Mar 20, 2024

hiyouga closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qlora+sft，微调qwen1.5-72b-int4，卡着没反应，最后Watchdog caught collective operation timeout: WorkNCCL(SeqNum=488, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802602 milliseconds before timing out #2702

Qlora+sft，微调qwen1.5-72b-int4，卡着没反应，最后Watchdog caught collective operation timeout: WorkNCCL(SeqNum=488, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802602 milliseconds before timing out #2702

TestNLP commented Mar 5, 2024

hiyouga commented Mar 5, 2024

JerryDaHeLian commented Mar 20, 2024

Qlora+sft，微调qwen1.5-72b-int4，卡着没反应，最后Watchdog caught collective operation timeout: WorkNCCL(SeqNum=488, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802602 milliseconds before timing out #2702

Qlora+sft，微调qwen1.5-72b-int4，卡着没反应，最后Watchdog caught collective operation timeout: WorkNCCL(SeqNum=488, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802602 milliseconds before timing out #2702

Comments

TestNLP commented Mar 5, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented Mar 5, 2024

JerryDaHeLian commented Mar 20, 2024