Qlora+sft,微调qwen1.5-72b-int4,卡着没反应,最后Watchdog caught collective operation timeout: WorkNCCL(SeqNum=488, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802602 milliseconds before timing out #2702
Labels
solved
This problem has been already solved
Reminder
Reproduction
torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6005 src/train_bash.py --deepspeed ds_zero2.json --model_name_or_path /work1/models/qwen2/qwen1.5_72b_int4/ --stage sft --template qwen --quantization_bit 4 --finetuning_type lora --lora_target q_proj,v_proj --do_train --flash_attn True --dataset factory_train_final_20240304_6k --output_dir output_1 --overwrite_output_dir --dataloader_num_workers 120 --gradient_accumulation_steps 8 --gradient_checkpointing True --bf16 True --num_train_epochs 2 --learning_rate 1e-5 --lr_scheduler_type "cosine" --warmup_ratio 0.01 --adam_beta2 0.95 --weight_decay 0.1 --evaluation_strategy epoch --eval_steps 100 --evaluation_strategy "no" --eval_accumulation_steps 1 --bf16_full_eval True --prediction_loss_only True --save_strategy epoch --save_steps 100 --save_total_limit 10 --logging_steps 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --log_level info --max_length 6144 --cutoff_len 6144 --ddp_timeout 180000000 --max_new_tokens 2048
ds_zero2.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"overlap_comm": true,
"contiguous_gradients": true
}
}
Expected behavior
No response
System Info
transformers 4.37.2
auto-gptq 0.5.1+cu117
bert4torch 0.3.9
deepspeed 0.12.5
torch 2.0.1
Linux系统
Others
The text was updated successfully, but these errors were encountered: