多卡训练lora超时 #74

Louis-y-nlp · 2023-06-25T03:33:27Z

您好，使用v100进行多卡训练总会遇到超时错误，4卡、2卡均报错。使用单卡似乎没有这种问题但是速度较慢。微调5w数据大约需要12小时。

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1805926 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805991 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

运行脚本

accelerate launch src/train_sft.py \
    --model_name_or_path ${model} \
    --do_train \
    --dataset my_dataset \
    --prompt_template alpaca \
    --finetuning_type lora --lora_target W_pack \
    --output_dir ${out_model} \
    --overwrite_cache \
    --per_device_train_batch_size 4 \ 
    --gradient_accumulation_steps 4 \ 
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --plot_loss \
    --fp16 \
    --auto_find_batch_size true --per_device_train_batch_size 16

default_config.yaml

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: /home/work/data/codes/LLaMA-Efficient-Tuning/deepspeed_config_stage2.yaml
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

hiyouga · 2023-06-25T03:46:47Z

是不是用了 nohup？

Louis-y-nlp · 2023-06-25T03:47:52Z

没有起后台，在docker中直接运行的。

hiyouga · 2023-06-25T03:55:32Z

尝试下 huggingface/accelerate#223

Louis-y-nlp · 2023-06-25T07:54:24Z

简单增加超时时间似乎不能解决问题，测试了下是卡在logging step上了，应该是其他rank等待rank 0计算loss时卡死了，暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪，会有多个进度条。logging step设置为20的时候进度条为：

  0%|▏                                                                                                                                                                        | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  0%|▎                                                                                                                                                                        | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:55:09, 27.49s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s]
  1%|█▍                                                                                                                                                                      | 20/2343 [05:24<7:37:56, 11.83s/it]

同时gpu利用率也一直是100%

hiyouga · 2023-06-25T14:00:23Z

关闭 deepspeed 试试，用普通的 accelerate config。

Louis-y-nlp · 2023-06-26T03:10:39Z

依旧卡在logging step中。
config yaml 如下：

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: true
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: ''
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

另外昨天logging step 设置为无穷大之后，在一个save step时，成功保存了一个ckpt之后卡住了，经历了7200s（指定的超时时间）之后报了相同的错误。

RuntimeError: NCCL communicator was aborted on rank 0.  Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3056, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7207694 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

hiyouga · 2023-06-26T04:31:37Z

试试这个 config：

compute_environment: LOCAL_MACHINE                                                                                                    
distributed_type: MULTI_GPU                                                                                                           
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 你的GPU数量
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Louis-y-nlp · 2023-06-26T06:31:13Z

还是会在logging step卡住

[INFO|trainer.py:1779] 2023-06-26 02:34:58,141 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-26 02:34:58,142 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-26 02:34:58,142 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-26 02:34:58,142 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-26 02:34:58,142 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-26 02:34:58,142 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-26 02:34:58,142 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-26 02:34:58,144 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:54:55, 27.48s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7205324 milliseconds before timing out.11:15:08, 17.35s/it]
f07b9fe29941:61323:61360 [1] NCCL INFO [Service thread] Connection closed by localRank 1
f07b9fe29941:61323:61344 [0] NCCL INFO comm 0x4724c640 rank 1 nranks 2 cudaDev 1 busId d0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=7200000) ran for 7206742 milliseconds before timing out.
f07b9fe29941:61322:61361 [0] NCCL INFO [Service thread] Connection closed by localRank 0
f07b9fe29941:61322:61341 [0] NCCL INFO comm 0x48215190 rank 0 nranks 2 cudaDev 0 busId c0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[04:37:26] ERROR    failed (exitcode: -6) local_rank: 0 (pid: 61322) of binary: /root/anaconda3/envs/dolly/bin/python                                                                                  api.py:672
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/anaconda3/envs/dolly/bin/accelerate:8 in <module>                                          │
│                                                                                                  │
│   5 from accelerate.commands.accelerate_cli import main                                          │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(main())                                                                         │
│   9                                                                                              │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py:45  │
│ in main                                                                                          │
│                                                                                                  │
│   42 │   │   exit(1)                                                                             │
│   43 │                                                                                           │
│   44 │   # Run                                                                                   │
│ ❱ 45 │   args.func(args)                                                                         │
│   46                                                                                             │
│   47                                                                                             │
│   48 if __name__ == "__main__":                                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:928 in      │
│ launch_command                                                                                   │
│                                                                                                  │
│   925 │   │   args.deepspeed_fields_from_accelerate_config = ",".join(args.deepspeed_fields_fr   │
│   926 │   │   deepspeed_launcher(args)                                                           │
│   927 │   elif args.use_fsdp and not args.cpu:                                                   │
│ ❱ 928 │   │   multi_gpu_launcher(args)                                                           │
│   929 │   elif args.use_megatron_lm and not args.cpu:                                            │
│   930 │   │   multi_gpu_launcher(args)                                                           │
│   931 │   elif args.multi_gpu and not args.cpu:                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:627 in      │
│ multi_gpu_launcher                                                                               │
│                                                                                                  │
│   624 │   )                                                                                      │
│   625 │   with patch_environment(**current_env):                                                 │
│   626 │   │   try:                                                                               │
│ ❱ 627 │   │   │   distrib_run.run(args)                                                          │
│   628 │   │   except Exception:                                                                  │
│   629 │   │   │   if is_rich_available() and debug:                                              │
│   630 │   │   │   │   console = get_console()                                                    │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/run.py:785 in run       │
│                                                                                                  │
│   782 │   │   )                                                                                  │
│   783 │                                                                                          │
│   784 │   config, cmd, cmd_args = config_from_args(args)                                         │
│ ❱ 785 │   elastic_launch(                                                                        │
│   786 │   │   config=config,                                                                     │
│   787 │   │   entrypoint=cmd,                                                                    │
│   788 │   )(*cmd_args)                                                                           │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:134 in  │
│ __call__                                                                                         │
│                                                                                                  │
│   131 │   │   self._entrypoint = entrypoint                                                      │
│   132 │                                                                                          │
│   133 │   def __call__(self, *args):                                                             │
│ ❱ 134 │   │   return launch_agent(self._config, self._entrypoint, list(args))                    │
│   135                                                                                            │
│   136                                                                                            │
│   137 def _get_entrypoint_name(                                                                  │
│                                                                                                  │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:250 in  │
│ launch_agent                                                                                     │
│                                                                                                  │
│   247 │   │   │   # if the error files for the failed children exist                             │
│   248 │   │   │   # @record will copy the first error (root cause)                               │
│   249 │   │   │   # to the error file of the launcher process.                                   │
│ ❱ 250 │   │   │   raise ChildFailedError(                                                        │
│   251 │   │   │   │   name=entrypoint_name,                                                      │
│   252 │   │   │   │   failures=result.failures,                                                  │
│   253 │   │   │   )                                                                              │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError: 
======================================================
src/train_sft.py FAILED
------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-26_04:37:26
  host      : f07b9fe29941
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 61323)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 61323
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-26_04:37:26
  host      : f07b9fe29941
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 61322)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 61322
======================================================

shaonianyr · 2023-06-26T08:08:02Z

把 NCCL 同步关了

Louis-y-nlp · 2023-06-27T02:10:02Z

加了NCCL_P2P_DISABLE=1之后第一步就会挂 @shaonianyr

wuxiuxiunlp · 2023-07-06T10:32:25Z

@Louis-y-nlp 多卡微调，跑通吗

Louis-y-nlp · 2023-07-06T12:01:36Z

没跑通，docker里一直卡死。

GitYCC · 2023-08-02T12:13:25Z

简单增加超时时间似乎不能解决问题，测试了下是卡在logging step上了，应该是其他rank等待rank 0计算loss时卡死了，暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪，会有多个进度条。logging step设置为20的时候进度条为：

  0%|▏                                                                                                                                                                        | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
  0%|▎                                                                                                                                                                        | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >>   Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >>   Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >>   Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >>   Number of trainable parameters = 4,194,304
  0%|▎                                                                                                                                                                        | 2/1170 [00:54<8:55:09, 27.49s/it]
  0%|                                                                                                                                                                                   | 0/2343 [00:00<?, ?it/s]
  1%|█▍                                                                                                                                                                      | 20/2343 [05:24<7:37:56, 11.83s/it]

同时gpu利用率也一直是100%

@Louis-y-nlp
How do you set the timeout value with accelerate launch?

Louis-y-nlp · 2023-08-03T02:41:23Z

@GitYCC

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

thugbobby · 2023-08-03T07:11:50Z

@Louis-y-nlp 请问您的多卡微调跑通了吗

Louis-y-nlp · 2023-08-03T07:13:31Z

没啊，多卡一直卡死，主要没有任何报错也不知道怎么调，就单卡能跑。

thugbobby · 2023-08-03T07:20:54Z

能跑

那有没有找到其他的解决方案，我试了好几个都不行。

Louis-y-nlp · 2023-08-08T09:02:38Z

拉了最新版本代码跑通了。

Louis-y-nlp · 2023-08-08T09:04:41Z

大佬神速啊，24小时高强度在线。

TianRuiHe · 2024-01-11T14:21:49Z

是不是用了 nohup？

您好，我也遇到了同样的问题，我使用了nohup进行后台挂起训练，请问这是什么原因呀
具体来说我的使用nohup在后台运行了一个使用deepspeed进行训练的代码，在运行了大概1000多个step后报错：
Connection closed by localRank -1
然后就停掉了

homiec · 2024-02-26T02:04:30Z

是不是用了 nohup？

想问一下，用了nohup就会有这个问题吗？

etoilestar · 2024-03-01T11:28:53Z

init_process_group

这个应该加在哪呢？

yawzhe · 2024-03-18T11:20:07Z

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

这个加在那呀，我设置 --ddp_time 加载数据集可以顺利加载一次，但是运行的时候要加载两次data_tokenizer,第二次就报错了。

JerryDaHeLian · 2024-03-20T01:03:39Z

数据集小没问题，数据集大就会timeout，很可能卡在tokenizer on dataset这一步，如果是，通过设置：
--preprocessing_num_workers 128 解决。

CaiJichang212 · 2024-11-23T08:07:06Z

我也遇到了NCCL Timeout问题，对qwen2-vl-7b，仅lora微调时正常运行。如果
下面是我的命令

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
# 指定GPU编号
export CUDA_VISIBLE_DEVICES=0,1
# 单机多卡训练
export FORCE_TORCHRUN=1
# If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  
# See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# 激活llamafactory的虚拟环境
source activate torch2py311cu12lmft # 命令行base环境

cur_date=$(date +%Y-%m-%d)
mkdir -p /root/autodl-fs/log/$cur_date

# 选择模型
# model=qwen2_vl_2b
model=qwen2_vl_7b
# 选择微调方法
# method=full
method=lora
# 任务阶段，指令监督微调
task=sft

# 默认值
pdbs=1 # per device batch size
gas=5 # gradient accumulation steps
bs=$((gas*pdbs*2)) # total batch size
steps=10 # logging steps
epoch=10 # number of epochs 默认3，+2，+2=7
lr=5e-5 # learning rate
max_grad_norm=1.0 # gradient clipping threshold 主流开源大模型默认1.0
cutoff_len=2048 # 截断长度尝试2048会不会报错


cnt=3
lr=9e-5 
gas=8
bs=$((gas*pdbs*2))
echo ${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}
llamafactory-cli train \
    --freeze_vision_tower false \
    --max_grad_norm $max_grad_norm \
    --output_dir /root/autodl-fs/saves/${model}/${method}-${task}/lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm} \
    --logging_steps $steps \
    --save_strategy epoch \
    --per_device_train_batch_size $pdbs \
    --gradient_accumulation_steps $gas \
    --learning_rate $lr \
    --num_train_epochs $epoch \
    --model_name_or_path /root/autodl-fs/huggingface/Qwen2-VL-7B-Instruct \
    --stage $task \
    --do_train true \
    --finetuning_type $method \
    --lora_target all \
    --dataset mire_train_check \
    --template qwen2_vl \
    --cutoff_len $cutoff_len \
    --max_samples 1000 \
    --overwrite_cache true \
    --preprocessing_num_workers 16 \
    --plot_loss true \
    --overwrite_output_dir true \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 true \
    --ddp_timeout 180000000 \
    --flash_attn fa2 \
    --enable_liger_kernel true \
    --deepspeed examples/deepspeed/ds_z2_config.json \
    --eval_dataset mire_train_check \
    --per_device_eval_batch_size $((pdbs*2)) \
    --eval_strategy epoch \
    > /root/autodl-fs/log/${cur_date}/${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}.log 2>&1 && /usr/bin/shutdown

报错信息；卡在13/620
{'loss': 0.8541, 'grad_norm': 6.580702781677246, 'learning_rate': 1.4516129032258065e-05, 'epoch': 0.16}

2%|█▋ | 10/620 [00:59<58:46, 5.78s/it]
2%|█▊ | 11/620 [01:05<59:23, 5.85s/it]
2%|█▉ | 12/620 [01:11<58:11, 5.74s/it]
2%|██▏ | 13/620 [01:17<58:26, 5.78s/it][rank1]:[E1123 16:02:10.083090635 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1517, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out.
[rank1]:[E1123 16:02:10.083642868 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1517, last enqueued NCCL work: 1517, last completed NCCL work: 1516.
[rank0]:[E1123 16:02:10.114598162 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1516, OpType=ALLREDUCE, NumelIn=25427968, NumelOut=25427968, Timeout(ms)=600000) ran for 600086 milliseconds before timing out.
[rank0]:[E1123 16:02:10.115071453 ProcessGroupNCCL.cpp:1664] [PG 1 Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1516, last enqueued NCCL work: 1516, last completed NCCL work: 1515.
[rank1]:[E1123 16:02:11.535545146 ProcessGroupNCCL.cpp:1709] [PG 1 Rank 1] Timeout at NCCL work: 1517, last enqueued NCCL work: 1517, last completed NCCL work: 1516.
[rank1]:[E1123 16:02:11.535815819 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E1123 16:02:11.535957791 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E1123 16:02:11.538140975 ProcessGroupNCCL.cpp:1515] [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1517, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600033 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f5af1f77f86 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f5aa3f5f8d2 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f5aa3f66313 in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f5aa3f686fc in /root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f5af16c7bf4 in /root/miniconda3/envs/torch2py311cu12lmft/bin/../lib/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f5af2dddac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7f5af2e6ebf4 in /usr/lib/x86_64-linux-gnu/libc.so.6)

W1123 16:02:11.882000 139878429300544 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1351 closing signal SIGTERM
E1123 16:02:12.398000 139878429300544 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 1352) of binary: /root/miniconda3/envs/torch2py311cu12lmft/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/torch2py311cu12lmft/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/torch2py311cu12lmft/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-23_16:02:11
host : autodl-container-b33448a49f-90764891
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 1352)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1352

CaiJichang212 · 2024-11-23T08:30:30Z

增大--ddp_timeout 360000000 \也无效

hiyouga added the pending This problem is yet to be addressed label Jun 25, 2023

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Aug 8, 2023

hiyouga closed this as completed Aug 8, 2023

hiyouga mentioned this issue Aug 17, 2023

preprocess_dataset dist.barrier crashed with NCCL communicator Socket Timeout #360

Closed

liuhao-0666 mentioned this issue Aug 18, 2023

使用accelerate对Baichuan-13B进行多卡微调时卡住 #570

Closed

hiyouga mentioned this issue Sep 18, 2023

when running tokenizer on datasets，program crashed #954

Closed

Aitejiu mentioned this issue Nov 30, 2023

使用accelerate和deepspeed进行多卡微调LLM卡住 #1683

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

多卡训练lora超时 #74

多卡训练lora超时 #74

Louis-y-nlp commented Jun 25, 2023

hiyouga commented Jun 25, 2023

Louis-y-nlp commented Jun 25, 2023

hiyouga commented Jun 25, 2023

Louis-y-nlp commented Jun 25, 2023

hiyouga commented Jun 25, 2023

Louis-y-nlp commented Jun 26, 2023

hiyouga commented Jun 26, 2023

Louis-y-nlp commented Jun 26, 2023

shaonianyr commented Jun 26, 2023

Louis-y-nlp commented Jun 27, 2023 •

edited

Loading

wuxiuxiunlp commented Jul 6, 2023

Louis-y-nlp commented Jul 6, 2023

GitYCC commented Aug 2, 2023 •

edited

Loading

Louis-y-nlp commented Aug 3, 2023

thugbobby commented Aug 3, 2023

Louis-y-nlp commented Aug 3, 2023

thugbobby commented Aug 3, 2023

Louis-y-nlp commented Aug 8, 2023

Louis-y-nlp commented Aug 8, 2023

TianRuiHe commented Jan 11, 2024

homiec commented Feb 26, 2024

etoilestar commented Mar 1, 2024

yawzhe commented Mar 18, 2024

JerryDaHeLian commented Mar 20, 2024

CaiJichang212 commented Nov 23, 2024

CaiJichang212 commented Nov 23, 2024

多卡训练lora超时 #74

多卡训练lora超时 #74

Comments

Louis-y-nlp commented Jun 25, 2023

hiyouga commented Jun 25, 2023

Louis-y-nlp commented Jun 25, 2023

hiyouga commented Jun 25, 2023

Louis-y-nlp commented Jun 25, 2023

hiyouga commented Jun 25, 2023

Louis-y-nlp commented Jun 26, 2023

hiyouga commented Jun 26, 2023

Louis-y-nlp commented Jun 26, 2023

shaonianyr commented Jun 26, 2023

Louis-y-nlp commented Jun 27, 2023 • edited Loading

wuxiuxiunlp commented Jul 6, 2023

Louis-y-nlp commented Jul 6, 2023

GitYCC commented Aug 2, 2023 • edited Loading

Louis-y-nlp commented Aug 3, 2023

thugbobby commented Aug 3, 2023

Louis-y-nlp commented Aug 3, 2023

thugbobby commented Aug 3, 2023

Louis-y-nlp commented Aug 8, 2023

Louis-y-nlp commented Aug 8, 2023

TianRuiHe commented Jan 11, 2024

homiec commented Feb 26, 2024

etoilestar commented Mar 1, 2024

yawzhe commented Mar 18, 2024

JerryDaHeLian commented Mar 20, 2024

CaiJichang212 commented Nov 23, 2024

/root/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-11-23_16:02:11 host : autodl-container-b33448a49f-90764891 rank : 1 (local_rank: 1) exitcode : -6 (pid: 1352) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 1352

CaiJichang212 commented Nov 23, 2024

Louis-y-nlp commented Jun 27, 2023 •

edited

Loading

GitYCC commented Aug 2, 2023 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-11-23_16:02:11
host : autodl-container-b33448a49f-90764891
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 1352)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1352