[BUG] The finetuning script failed just after saving the lora model. #328

milk-bottle-liyu · 2024-03-13T10:30:32Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

I am trying to finetune the Qwen-VL-Chat model with lora on my own dataset by using the finetune_lora_ds.sh in the repository. My device is V100-32G. I modify the script as followed.

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

GPUS_PER_NODE=3
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="/nas_data/pink/tesla/pretrained_models/qwen_vl_chat"  #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL"  Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="../data/v1_0_train_nus_qwen.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --ddp_timeout 7200 \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --fp16 True \
    --fix_vit True \
    --output_dir /nas_data/pink/tesla/intermediate/drive_lm/qwen_vl \
    --num_train_epochs 5 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 2048 \
    --lazy_preprocess True \
    --use_lora \
    --deepspeed finetune/ds_config_zero2.json \
    --gradient_checkpointing &>> qwen_vl.log &

At first, the training process is normal. But when the model is saved, or in another word, just after the 200-th step, the process crushed with some ambiguous log.

{'loss': 0.4006, 'learning_rate': 1e-05, 'epoch': 0.65}
{'loss': 0.3654, 'learning_rate': 1e-05, 'epoch': 0.65}
{'loss': 0.4261, 'learning_rate': 1e-05, 'epoch': 0.65}
 13%|█▎        | 200/1530 [4:51:45<31:34:20, 85.46s/it]/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800195 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800195 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800563 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800563 milliseconds before timing out.
[2024-03-13 04:05:15,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1989172 closing signal SIGTERM
[2024-03-13 04:05:45,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1989172 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-03-13 05:17:52,533] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 1989173) of binary: /data/pink/anaconda3/envs/qwen/bin/python
Traceback (most recent call last):
  File "/data/pink/anaconda3/envs/qwen/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
finetune.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-13_04:05:15
  host      : amax
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 1989174)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1989174
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-13_04:05:15
  host      : amax
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1989173)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1989173
========================================================

I have tried the suggestion from hiyouga/LLaMA-Factory#1683 (comment) and add export NCCL_P2P_LEVEL=NVL before runing the script. But it crashed in the same way.

The output dir contains the parameters and it looks like a normal lora dir.

期望行为 | Expected Behavior

I am expecting the finetuning scripts not failing.

复现方法 | Steps To Reproduce

run the finetune_lora_ds.sh

运行环境 | Environment

- OS:Ubuntu 22.04
- Python:3.8
- Transformers:4.32.0
- PyTorch: 2.1.2(py3.8_cuda12.1_cudnn8.9.2_0)
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

No response

The text was updated successfully, but these errors were encountered:

yawzhe · 2024-03-19T02:26:10Z

请问解决了没

milk-bottle-liyu · 2024-03-30T07:21:11Z

请问解决了没

按照requirements中的版本重装包，然后就出log了，发现是oom

milk-bottle-liyu closed this as completed Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] The finetuning script failed just after saving the lora model. #328

[BUG] The finetuning script failed just after saving the lora model. #328

milk-bottle-liyu commented Mar 13, 2024 •

edited

Loading

yawzhe commented Mar 19, 2024

milk-bottle-liyu commented Mar 30, 2024

[BUG] The finetuning script failed just after saving the lora model. #328

[BUG] The finetuning script failed just after saving the lora model. #328

Comments

milk-bottle-liyu commented Mar 13, 2024 • edited Loading

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

yawzhe commented Mar 19, 2024

milk-bottle-liyu commented Mar 30, 2024

milk-bottle-liyu commented Mar 13, 2024 •

edited

Loading