Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qlora+sft,微调qwen1.5-72b-int4,卡着没反应,最后Watchdog caught collective operation timeout: WorkNCCL(SeqNum=488, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1802602 milliseconds before timing out #2702

Closed
1 task done
TestNLP opened this issue Mar 5, 2024 · 2 comments
Labels
solved This problem has been already solved

Comments

@TestNLP
Copy link

TestNLP commented Mar 5, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6005 src/train_bash.py --deepspeed ds_zero2.json --model_name_or_path /work1/models/qwen2/qwen1.5_72b_int4/ --stage sft --template qwen --quantization_bit 4 --finetuning_type lora --lora_target q_proj,v_proj --do_train --flash_attn True --dataset factory_train_final_20240304_6k --output_dir output_1 --overwrite_output_dir --dataloader_num_workers 120 --gradient_accumulation_steps 8 --gradient_checkpointing True --bf16 True --num_train_epochs 2 --learning_rate 1e-5 --lr_scheduler_type "cosine" --warmup_ratio 0.01 --adam_beta2 0.95 --weight_decay 0.1 --evaluation_strategy epoch --eval_steps 100 --evaluation_strategy "no" --eval_accumulation_steps 1 --bf16_full_eval True --prediction_loss_only True --save_strategy epoch --save_steps 100 --save_total_limit 10 --logging_steps 10 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --log_level info --max_length 6144 --cutoff_len 6144 --ddp_timeout 180000000 --max_new_tokens 2048

ds_zero2.json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"initial_scale_power": 16,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"overlap_comm": true,
"contiguous_gradients": true
}
}

Expected behavior

No response

System Info

transformers 4.37.2
auto-gptq 0.5.1+cu117
bert4torch 0.3.9
deepspeed 0.12.5
torch 2.0.1

Linux系统

Others

image
@hiyouga hiyouga added the pending This problem is yet to be addressed label Mar 5, 2024
@hiyouga
Copy link
Owner

hiyouga commented Mar 5, 2024

Try #1683

@JerryDaHeLian
Copy link

我是通过:
--preprocessing_num_workers 128
解决的。

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Mar 20, 2024
@hiyouga hiyouga closed this as completed Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants