Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] The finetuning script failed just after saving the lora model. #328

Closed
2 tasks done
milk-bottle-liyu opened this issue Mar 13, 2024 · 2 comments
Closed
2 tasks done

Comments

@milk-bottle-liyu
Copy link

milk-bottle-liyu commented Mar 13, 2024

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

I am trying to finetune the Qwen-VL-Chat model with lora on my own dataset by using the finetune_lora_ds.sh in the repository. My device is V100-32G. I modify the script as followed.

#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`

GPUS_PER_NODE=3
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

MODEL="/nas_data/pink/tesla/pretrained_models/qwen_vl_chat"  #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL"  Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="../data/v1_0_train_nus_qwen.json"

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS finetune.py \
    --ddp_timeout 7200 \
    --model_name_or_path $MODEL \
    --data_path $DATA \
    --fp16 True \
    --fix_vit True \
    --output_dir /nas_data/pink/tesla/intermediate/drive_lm/qwen_vl \
    --num_train_epochs 5 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 200 \
    --save_total_limit 10 \
    --learning_rate 1e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.01 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --report_to "none" \
    --model_max_length 2048 \
    --lazy_preprocess True \
    --use_lora \
    --deepspeed finetune/ds_config_zero2.json \
    --gradient_checkpointing &>> qwen_vl.log &

At first, the training process is normal. But when the model is saved, or in another word, just after the 200-th step, the process crushed with some ambiguous log.

{'loss': 0.4006, 'learning_rate': 1e-05, 'epoch': 0.65}
{'loss': 0.3654, 'learning_rate': 1e-05, 'epoch': 0.65}
{'loss': 0.4261, 'learning_rate': 1e-05, 'epoch': 0.65}
 13%|█▎        | 200/1530 [4:51:45<31:34:20, 85.46s/it]/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800195 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800195 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800563 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800563 milliseconds before timing out.
[2024-03-13 04:05:15,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1989172 closing signal SIGTERM
[2024-03-13 04:05:45,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1989172 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-03-13 05:17:52,533] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 1989173) of binary: /data/pink/anaconda3/envs/qwen/bin/python
Traceback (most recent call last):
  File "/data/pink/anaconda3/envs/qwen/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
finetune.py FAILED
--------------------------------------------------------
Failures:
[1]:
  time      : 2024-03-13_04:05:15
  host      : amax
  rank      : 2 (local_rank: 2)
  exitcode  : -6 (pid: 1989174)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1989174
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-13_04:05:15
  host      : amax
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1989173)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1989173
========================================================

I have tried the suggestion from hiyouga/LLaMA-Factory#1683 (comment) and add export NCCL_P2P_LEVEL=NVL before runing the script. But it crashed in the same way.

The output dir contains the parameters and it looks like a normal lora dir.

期望行为 | Expected Behavior

I am expecting the finetuning scripts not failing.

复现方法 | Steps To Reproduce

  1. run the finetune_lora_ds.sh

运行环境 | Environment

- OS:Ubuntu 22.04
- Python:3.8
- Transformers:4.32.0
- PyTorch: 2.1.2(py3.8_cuda12.1_cudnn8.9.2_0)
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):12.1

备注 | Anything else?

No response

@yawzhe
Copy link

yawzhe commented Mar 19, 2024

请问解决了没

@milk-bottle-liyu
Copy link
Author

请问解决了没

按照requirements中的版本重装包,然后就出log了,发现是oom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants