Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多卡训练lora超时 #74

Closed
Louis-y-nlp opened this issue Jun 25, 2023 · 26 comments
Closed

多卡训练lora超时 #74

Louis-y-nlp opened this issue Jun 25, 2023 · 26 comments
Labels
solved This problem has been already solved

Comments

@Louis-y-nlp
Copy link

您好,使用v100进行多卡训练总会遇到超时错误,4卡、2卡均报错。使用单卡似乎没有这种问题但是速度较慢。微调5w数据大约需要12小时。

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1805926 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805991 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

运行脚本

accelerate launch src/train_sft.py \
    --model_name_or_path ${model} \
    --do_train \
    --dataset my_dataset \
    --prompt_template alpaca \
    --finetuning_type lora --lora_target W_pack \
    --output_dir ${out_model} \
    --overwrite_cache \
    --per_device_train_batch_size 4 \ 
    --gradient_accumulation_steps 4 \ 
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --auto_find_batch_size true --per_device_train_batch_size 16

default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: /home/work/data/codes/LLaMA-Efficient-Tuning/deepspeed_config_stage2.yaml
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
@hiyouga
Copy link
Owner

hiyouga commented Jun 25, 2023

是不是用了 nohup?

@Louis-y-nlp
Copy link
Author

没有起后台,在docker中直接运行的。

@hiyouga
Copy link
Owner

hiyouga commented Jun 25, 2023

尝试下 huggingface/accelerate#223

@Louis-y-nlp
Copy link
Author

简单增加超时时间似乎不能解决问题,测试了下是卡在logging step上了,应该是其他rank等待rank 0计算loss时卡死了,暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪,会有多个进度条。logging step设置为20的时候进度条为:

  0%|| 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  0%|| 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >>   Number of trainable parameters = 4,194,304
  0%|| 2/1170 [00:54<8:55:09, 27.49s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s]
  1%|█▍                                                                                                                                                                      | 20/2343 [05:24<7:37:56, 11.83s/it]

同时gpu利用率也一直是100%

@hiyouga hiyouga added the pending This problem is yet to be addressed label Jun 25, 2023
@hiyouga
Copy link
Owner

hiyouga commented Jun 25, 2023

关闭 deepspeed 试试,用普通的 accelerate config。

@Louis-y-nlp
Copy link
Author

依旧卡在logging step中。
config yaml 如下:

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: true
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: ''
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

另外昨天logging step 设置为无穷大之后,在一个save step时,成功保存了一个ckpt之后卡住了,经历了7200s(指定的超时时间)之后报了相同的错误。

RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3056, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7207694 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

@hiyouga
Copy link
Owner

hiyouga commented Jun 26, 2023

试试这个 config:

compute_environment: LOCAL_MACHINE                                                                                                    
distributed_type: MULTI_GPU                                                                                                           
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 你的GPU数量
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

@Louis-y-nlp
Copy link
Author

还是会在logging step卡住

[INFO|trainer.py:1779] 2023-06-26 02:34:58,141 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-26 02:34:58,142 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-26 02:34:58,142 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-26 02:34:58,142 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-26 02:34:58,142 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-26 02:34:58,142 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-26 02:34:58,142 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-26 02:34:58,144 >>   Number of trainable parameters = 4,194,304
  0%|| 2/1170 [00:54<8:54:55, 27.48s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7205324 milliseconds before timing out.11:15:08, 17.35s/it]
f07b9fe29941:61323:61360 [1] NCCL INFO [Service thread] Connection closed by localRank 1
f07b9fe29941:61323:61344 [0] NCCL INFO comm 0x4724c640 rank 1 nranks 2 cudaDev 1 busId d0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=7200000) ran for 7206742 milliseconds before timing out.
f07b9fe29941:61322:61361 [0] NCCL INFO [Service thread] Connection closed by localRank 0
f07b9fe29941:61322:61341 [0] NCCL INFO comm 0x48215190 rank 0 nranks 2 cudaDev 0 busId c0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[04:37:26] ERROR    failed (exitcode: -6) local_rank: 0 (pid: 61322) of binary: /root/anaconda3/envs/dolly/bin/python                                                                                  api.py:672
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/anaconda3/envs/dolly/bin/accelerate:8 in <module>                                          │
│                                                                                                  │
│   5 from accelerate.commands.accelerate_cli import main                                          │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py:45  │
│ in main                                                                                          │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:928 in      │
│ launch_command                                                                                   │
│                                                                                                  │
│   925 │   │   args.deepspeed_fields_from_accelerate_config = ",".join(args.deepspeed_fields_fr   │
│   926 │   │   deepspeed_launcher(args)                                                           │
│   927 │   elif args.use_fsdp and not args.cpu:                                                   │
│ ❱ 928 │   │   multi_gpu_launcher(args)                                                           │
│   929 │   elif args.use_megatron_lm and not args.cpu:                                            │
│   930 │   │   multi_gpu_launcher(args)                                                           │
│   931 │   elif args.multi_gpu and not args.cpu:                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:627 in      │
│ multi_gpu_launcher                                                                               │
│                                                                                                  │
│   624 │   )                                                                                      │
│   625 │   with patch_environment(**current_env):                                                 │
│   626 │   │   try:                                                                               │
│ ❱ 627 │   │   │   distrib_run.run(args)                                                          │
│   628 │   │   except Exception:                                                                  │
│   629 │   │   │   if is_rich_available() and debug:                                              │
│   630 │   │   │   │   console = get_console()                                                    │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/run.py:785 in run       │
│                                                                                                  │
│   782 │   │   )                                                                                  │
│   783 │                                                                                          │
│   784 │   config, cmd, cmd_args = config_from_args(args)                                         │
│ ❱ 785 │   elastic_launch(                                                                        │
│   786 │   │   config=config,                                                                     │
│   787 │   │   entrypoint=cmd,                                                                    │
│   788 │   )(*cmd_args)                                                                           │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:134 in  │
│ __call__                                                                                         │
│                                                                                                  │
│   131 │   │   self._entrypoint = entrypoint                                                      │
│   132 │                                                                                          │
│   133 │   def __call__(self, *args):                                                             │
│ ❱ 134 │   │   return launch_agent(self._config, self._entrypoint, list(args))                    │
│   135                                                                                            │
│   136                                                                                            │
│   137 def _get_entrypoint_name(                                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:250 in  │
│ launch_agent                                                                                     │
│                                                                                                  │
│   247 │   │   │   # if the error files for the failed children exist                             │
│   248 │   │   │   # @record will copy the first error (root cause)                               │
│   249 │   │   │   # to the error file of the launcher process.                                   │
│ ❱ 250 │   │   │   raise ChildFailedError(                                                        │
│   251 │   │   │   │   name=entrypoint_name,                                                      │
│   252 │   │   │   │   failures=result.failures,                                                  │
│   253 │   │   │   )                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError: 
======================================================
src/train_sft.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-26_04:37:26
  host      : f07b9fe29941
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 61323)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 61323
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-26_04:37:26
  host      : f07b9fe29941
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 61322)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 61322
======================================================

@shaonianyr
Copy link

把 NCCL 同步关了

@Louis-y-nlp
Copy link
Author

Louis-y-nlp commented Jun 27, 2023

加了NCCL_P2P_DISABLE=1之后第一步就会挂 @shaonianyr

@wuxiuxiunlp
Copy link

@Louis-y-nlp 多卡微调,跑通吗

@Louis-y-nlp
Copy link
Author

没跑通,docker里一直卡死。

@GitYCC
Copy link
Contributor

GitYCC commented Aug 2, 2023

简单增加超时时间似乎不能解决问题,测试了下是卡在logging step上了,应该是其他rank等待rank 0计算loss时卡死了,暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪,会有多个进度条。logging step设置为20的时候进度条为:

  0%|| 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  0%|| 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >>   Number of trainable parameters = 4,194,304
  0%|| 2/1170 [00:54<8:55:09, 27.49s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s]
  1%|█▍                                                                                                                                                                      | 20/2343 [05:24<7:37:56, 11.83s/it]

同时gpu利用率也一直是100%

@Louis-y-nlp
How do you set the timeout value with accelerate launch?

@Louis-y-nlp
Copy link
Author

@GitYCC

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

@thugbobby
Copy link

@Louis-y-nlp 请问您的多卡微调跑通了吗

@Louis-y-nlp
Copy link
Author

没啊,多卡一直卡死,主要没有任何报错也不知道怎么调,就单卡能跑。

@thugbobby
Copy link

能跑

那有没有找到其他的解决方案,我试了好几个都不行。

@Louis-y-nlp
Copy link
Author

拉了最新版本代码跑通了。

@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Aug 8, 2023
@hiyouga hiyouga closed this as completed Aug 8, 2023
@Louis-y-nlp
Copy link
Author

大佬神速啊,24小时高强度在线。

@TianRuiHe
Copy link

是不是用了 nohup?

您好,我也遇到了同样的问题,我使用了nohup进行后台挂起训练,请问这是什么原因呀
具体来说我的使用nohup在后台运行了一个使用deepspeed进行训练的代码,在运行了大概1000多个step后报错:
Connection closed by localRank -1
然后就停掉了

@homiec
Copy link

homiec commented Feb 26, 2024

是不是用了 nohup?

想问一下,用了nohup就会有这个问题吗?

@etoilestar
Copy link

init_process_group

这个应该加在哪呢?

@yawzhe
Copy link

yawzhe commented Mar 18, 2024

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

这个加在那呀,我设置 --ddp_time 加载数据集 可以顺利加载一次,但是运行的时候要加载两次data_tokenizer,第二次就报错了。
微信图片_20240318191811

@JerryDaHeLian
Copy link

数据集小没问题,数据集大就会timeout,很可能卡在tokenizer on dataset这一步,如果是,通过设置:
--preprocessing_num_workers 128 解决。

@CaiJichang212
Copy link

我也遇到了NCCL Timeout问题,对qwen2-vl-7b,仅lora微调时正常运行。如果
下面是我的命令

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
# 指定GPU编号
export CUDA_VISIBLE_DEVICES=0,1
# 单机多卡训练
export FORCE_TORCHRUN=1
# If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  
# See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# 激活llamafactory的虚拟环境
source activate torch2py311cu12lmft # 命令行base环境

cur_date=$(date +%Y-%m-%d)
mkdir -p /root/autodl-fs/log/$cur_date

# 选择模型
# model=qwen2_vl_2b
model=qwen2_vl_7b
# 选择微调方法
# method=full
method=lora
# 任务阶段,指令监督微调
task=sft

# 默认值
pdbs=1 # per device batch size
gas=5 # gradient accumulation steps
bs=$((gas*pdbs*2)) # total batch size
steps=10 # logging steps
epoch=10 # number of epochs 默认3,+2,+2=7
lr=5e-5 # learning rate
max_grad_norm=1.0 # gradient clipping threshold 主流开源大模型默认1.0
cutoff_len=2048 # 截断长度尝试2048会不会报错


cnt=3
lr=9e-5 
gas=8
bs=$((gas*pdbs*2))
echo ${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}
llamafactory-cli train \
    --freeze_vision_tower false \
    --max_grad_norm $max_grad_norm \
    --output_dir /root/autodl-fs/saves/${model}/${method}-${task}/lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm} \
    --logging_steps $steps \
    --save_strategy epoch \
    --per_device_train_batch_size $pdbs \
    --gradient_accumulation_steps $gas \
    --learning_rate $lr \
    --num_train_epochs $epoch \
    --model_name_or_path /root/autodl-fs/huggingface/Qwen2-VL-7B-Instruct \
    --stage $task \
    --do_train true \
    --finetuning_type $method \
    --lora_target all \
    --dataset mire_train_check \
    --template qwen2_vl \
    --cutoff_len $cutoff_len \
    --max_samples 1000 \
    --overwrite_cache true \
    --preprocessing_num_workers 16 \
    --plot_loss true \
    --overwrite_output_dir true \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 true \
    --ddp_timeout 180000000 \
    --flash_attn fa2 \
    --enable_liger_kernel true \
    --deepspeed examples/deepspeed/ds_z2_config.json \
    --eval_dataset mire_train_check \
    --per_device_eval_batch_size $((pdbs*2)) \
    --eval_strategy epoch \
    > /root/autodl-fs/log/${cur_date}/${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}.log 2>&1 && /usr/bin/shutdown

报错信息;卡在13/620
{'loss': 0.8541, 'grad_norm': 6.580702781677246, 'learning_rate': 1.4516129032258065e-05, 'epoch': 0.16}

2%|█▋ | 10/620 [00:59<58:46, 5.78s/it]
2%|█▊ | 11/620 [01:05<59:23, 5.85s/it]
2%|█▉ | 12/620 [01:11<58:11, 5.74s/it]
2%|██▏ | 13/620 [01:17<58:26, 5.78s/it][rank1]:[E1123 16:02:10.083090635 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1517, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out.
[rank1]:[E1123 16:02:10.083642868 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1517, last enqueued NCCL work: 1517, last completed NCCL work: 1516.
[rank0]:[E1123 16:02:10.114598162 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1516, OpType=ALLREDUCE, NumelIn=25427968, NumelOut=25427968, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
[rank0]:[E1123 16:02:10.115071453 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1516, last enqueued NCCL work: 1516, last completed NCCL work: 1515.
[rank1]:[E1123 16:02:11.535545146 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 1] Timeout at NCCL work: 1517, last enqueued NCCL work: 1517, last completed NCCL work: 1516.
[rank1]:[E1123 16:02:11.535815819 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1123 16:02:11.535957791 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1123 16:02:11.538140975 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1517, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5af1f77f86 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5aa3f5f8d2 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f5aa3f66313 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5aa3f686fc in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f5af16c7bf4 in /root/miniconda3/envs/torch2py311cu12lmft/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f5af2dddac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f5af2e6ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1123 16:02:11.882000 139878429300544 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1351 closing signal SIGTERM
E1123 16:02:12.398000 139878429300544 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 1352) of binary: /root/miniconda3/envs/torch2py311cu12lmft/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/torch2py311cu12lmft/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-23_16:02:11
host : autodl-container-b33448a49f-90764891
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 1352)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1352

@CaiJichang212
Copy link

增大--ddp_timeout 360000000 \也无效

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests