-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多卡训练lora超时 #74
Comments
是不是用了 nohup? |
没有起后台,在docker中直接运行的。 |
简单增加超时时间似乎不能解决问题,测试了下是卡在logging step上了,应该是其他rank等待rank 0计算loss时卡死了,暂时把logging steps 设置成1e9。另外运行日志看起来很奇怪,会有多个进度条。logging step设置为20的时候进度条为: 0%|▏ | 1/1170 [00:18<6:02:19, 18.60s/it]06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
06/25/2023 07:35:18 - INFO - torch.nn.parallel.distributed - Reducer buckets have been rebuilt in this iteration.
0%|▎ | 2/1170 [00:34<5:36:33, 17.29s/it][INFO|trainer.py:1779] 2023-06-25 07:35:55,332 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-25 07:35:55,333 >> Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-25 07:35:55,333 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-25 07:35:55,333 >> Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-25 07:35:55,333 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-25 07:35:55,334 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-25 07:35:55,334 >> Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-25 07:35:55,336 >> Number of trainable parameters = 4,194,304
0%|▎ | 2/1170 [00:54<8:55:09, 27.49s/it]
0%| | 0/2343 [00:00<?, ?it/s]
1%|█▍ | 20/2343 [05:24<7:37:56, 11.83s/it] 同时gpu利用率也一直是100% |
关闭 deepspeed 试试,用普通的 accelerate config。 |
依旧卡在logging step中。 compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_offload_params: true
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: ''
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false 另外昨天logging step 设置为无穷大之后,在一个save step时,成功保存了一个ckpt之后卡住了,经历了7200s(指定的超时时间)之后报了相同的错误。 RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3056, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7207694 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. |
试试这个 config: compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 你的GPU数量
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false |
还是会在logging step卡住 [INFO|trainer.py:1779] 2023-06-26 02:34:58,141 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-26 02:34:58,142 >> Num examples = 50,000
[INFO|trainer.py:1781] 2023-06-26 02:34:58,142 >> Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-26 02:34:58,142 >> Instantaneous batch size per device = 16
[INFO|trainer.py:1783] 2023-06-26 02:34:58,142 >> Total train batch size (w. parallel, distributed & accumulation) = 128
[INFO|trainer.py:1784] 2023-06-26 02:34:58,142 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1785] 2023-06-26 02:34:58,142 >> Total optimization steps = 2,343
[INFO|trainer.py:1786] 2023-06-26 02:34:58,144 >> Number of trainable parameters = 4,194,304
0%|▎ | 2/1170 [00:54<8:54:55, 27.48s/it]
0%| | 0/2343 [00:00<?, ?it/s[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=ALLGATHER, Timeout(ms)=7200000) ran for 7205324 milliseconds before timing out.11:15:08, 17.35s/it]
f07b9fe29941:61323:61360 [1] NCCL INFO [Service thread] Connection closed by localRank 1
f07b9fe29941:61323:61344 [0] NCCL INFO comm 0x4724c640 rank 1 nranks 2 cudaDev 1 busId d0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=80, OpType=BROADCAST, Timeout(ms)=7200000) ran for 7206742 milliseconds before timing out.
f07b9fe29941:61322:61361 [0] NCCL INFO [Service thread] Connection closed by localRank 0
f07b9fe29941:61322:61341 [0] NCCL INFO comm 0x48215190 rank 0 nranks 2 cudaDev 0 busId c0 - Abort COMPLETE
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[04:37:26] ERROR failed (exitcode: -6) local_rank: 0 (pid: 61322) of binary: /root/anaconda3/envs/dolly/bin/python api.py:672
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /root/anaconda3/envs/dolly/bin/accelerate:8 in <module> │
│ │
│ 5 from accelerate.commands.accelerate_cli import main │
│ 6 if __name__ == '__main__': │
│ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
│ ❱ 8 │ sys.exit(main()) │
│ 9 │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py:45 │
│ in main │
│ │
│ 42 │ │ exit(1) │
│ 43 │ │
│ 44 │ # Run │
│ ❱ 45 │ args.func(args) │
│ 46 │
│ 47 │
│ 48 if __name__ == "__main__": │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:928 in │
│ launch_command │
│ │
│ 925 │ │ args.deepspeed_fields_from_accelerate_config = ",".join(args.deepspeed_fields_fr │
│ 926 │ │ deepspeed_launcher(args) │
│ 927 │ elif args.use_fsdp and not args.cpu: │
│ ❱ 928 │ │ multi_gpu_launcher(args) │
│ 929 │ elif args.use_megatron_lm and not args.cpu: │
│ 930 │ │ multi_gpu_launcher(args) │
│ 931 │ elif args.multi_gpu and not args.cpu: │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/accelerate/commands/launch.py:627 in │
│ multi_gpu_launcher │
│ │
│ 624 │ ) │
│ 625 │ with patch_environment(**current_env): │
│ 626 │ │ try: │
│ ❱ 627 │ │ │ distrib_run.run(args) │
│ 628 │ │ except Exception: │
│ 629 │ │ │ if is_rich_available() and debug: │
│ 630 │ │ │ │ console = get_console() │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/run.py:785 in run │
│ │
│ 782 │ │ ) │
│ 783 │ │
│ 784 │ config, cmd, cmd_args = config_from_args(args) │
│ ❱ 785 │ elastic_launch( │
│ 786 │ │ config=config, │
│ 787 │ │ entrypoint=cmd, │
│ 788 │ )(*cmd_args) │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:134 in │
│ __call__ │
│ │
│ 131 │ │ self._entrypoint = entrypoint │
│ 132 │ │
│ 133 │ def __call__(self, *args): │
│ ❱ 134 │ │ return launch_agent(self._config, self._entrypoint, list(args)) │
│ 135 │
│ 136 │
│ 137 def _get_entrypoint_name( │
│ │
│ /root/anaconda3/envs/dolly/lib/python3.9/site-packages/torch/distributed/launcher/api.py:250 in │
│ launch_agent │
│ │
│ 247 │ │ │ # if the error files for the failed children exist │
│ 248 │ │ │ # @record will copy the first error (root cause) │
│ 249 │ │ │ # to the error file of the launcher process. │
│ ❱ 250 │ │ │ raise ChildFailedError( │
│ 251 │ │ │ │ name=entrypoint_name, │
│ 252 │ │ │ │ failures=result.failures, │
│ 253 │ │ │ ) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ChildFailedError:
======================================================
src/train_sft.py FAILED
------------------------------------------------------
Failures:
[1]:
time : 2023-06-26_04:37:26
host : f07b9fe29941
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 61323)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 61323
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-06-26_04:37:26
host : f07b9fe29941
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 61322)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 61322
====================================================== |
把 NCCL 同步关了 |
加了NCCL_P2P_DISABLE=1之后第一步就会挂 @shaonianyr |
@Louis-y-nlp 多卡微调,跑通吗 |
没跑通,docker里一直卡死。 |
@Louis-y-nlp |
torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx)) |
@Louis-y-nlp 请问您的多卡微调跑通了吗 |
没啊,多卡一直卡死,主要没有任何报错也不知道怎么调,就单卡能跑。 |
那有没有找到其他的解决方案,我试了好几个都不行。 |
拉了最新版本代码跑通了。 |
大佬神速啊,24小时高强度在线。 |
您好,我也遇到了同样的问题,我使用了nohup进行后台挂起训练,请问这是什么原因呀 |
想问一下,用了nohup就会有这个问题吗? |
这个应该加在哪呢? |
数据集小没问题,数据集大就会timeout,很可能卡在tokenizer on dataset这一步,如果是,通过设置: |
我也遇到了NCCL Timeout问题,对qwen2-vl-7b,仅lora微调时正常运行。如果 export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
# 指定GPU编号
export CUDA_VISIBLE_DEVICES=0,1
# 单机多卡训练
export FORCE_TORCHRUN=1
# If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
# See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# 激活llamafactory的虚拟环境
source activate torch2py311cu12lmft # 命令行base环境
cur_date=$(date +%Y-%m-%d)
mkdir -p /root/autodl-fs/log/$cur_date
# 选择模型
# model=qwen2_vl_2b
model=qwen2_vl_7b
# 选择微调方法
# method=full
method=lora
# 任务阶段,指令监督微调
task=sft
# 默认值
pdbs=1 # per device batch size
gas=5 # gradient accumulation steps
bs=$((gas*pdbs*2)) # total batch size
steps=10 # logging steps
epoch=10 # number of epochs 默认3,+2,+2=7
lr=5e-5 # learning rate
max_grad_norm=1.0 # gradient clipping threshold 主流开源大模型默认1.0
cutoff_len=2048 # 截断长度尝试2048会不会报错
cnt=3
lr=9e-5
gas=8
bs=$((gas*pdbs*2))
echo ${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}
llamafactory-cli train \
--freeze_vision_tower false \
--max_grad_norm $max_grad_norm \
--output_dir /root/autodl-fs/saves/${model}/${method}-${task}/lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm} \
--logging_steps $steps \
--save_strategy epoch \
--per_device_train_batch_size $pdbs \
--gradient_accumulation_steps $gas \
--learning_rate $lr \
--num_train_epochs $epoch \
--model_name_or_path /root/autodl-fs/huggingface/Qwen2-VL-7B-Instruct \
--stage $task \
--do_train true \
--finetuning_type $method \
--lora_target all \
--dataset mire_train_check \
--template qwen2_vl \
--cutoff_len $cutoff_len \
--max_samples 1000 \
--overwrite_cache true \
--preprocessing_num_workers 16 \
--plot_loss true \
--overwrite_output_dir true \
--lr_scheduler_type cosine \
--warmup_ratio 0.1 \
--bf16 true \
--ddp_timeout 180000000 \
--flash_attn fa2 \
--enable_liger_kernel true \
--deepspeed examples/deepspeed/ds_z2_config.json \
--eval_dataset mire_train_check \
--per_device_eval_batch_size $((pdbs*2)) \
--eval_strategy epoch \
> /root/autodl-fs/log/${cur_date}/${cnt}-${model}-${method}-${task}-lr${lr}-bs${bs}-epoch${epoch}-cutoff${cutoff_len}-grad_norm${max_grad_norm}.log 2>&1 && /usr/bin/shutdown 报错信息;卡在13/620 2%|█▋ | 10/620 [00:59<58:46, 5.78s/it] W1123 16:02:11.882000 139878429300544 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1351 closing signal SIGTERM
|
增大--ddp_timeout 360000000 \也无效 |
您好,使用v100进行多卡训练总会遇到超时错误,4卡、2卡均报错。使用单卡似乎没有这种问题但是速度较慢。微调5w数据大约需要12小时。
运行脚本
default_config.yaml
The text was updated successfully, but these errors were encountered: