You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
我已经搜索过FAQ | I have searched FAQ
当前行为 | Current Behavior
I am trying to finetune the Qwen-VL-Chat model with lora on my own dataset by using the finetune_lora_ds.sh in the repository. My device is V100-32G. I modify the script as followed.
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
DIR=`pwd`
GPUS_PER_NODE=3
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001
MODEL="/nas_data/pink/tesla/pretrained_models/qwen_vl_chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" Set the path if you do not want to load from huggingface directly
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="../data/v1_0_train_nus_qwen.json"
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS finetune.py \
--ddp_timeout 7200 \
--model_name_or_path $MODEL \
--data_path $DATA \
--fp16 True \
--fix_vit True \
--output_dir /nas_data/pink/tesla/intermediate/drive_lm/qwen_vl \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 200 \
--save_total_limit 10 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--adam_beta2 0.95 \
--warmup_ratio 0.01 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "none" \
--model_max_length 2048 \
--lazy_preprocess True \
--use_lora \
--deepspeed finetune/ds_config_zero2.json \
--gradient_checkpointing &>> qwen_vl.log &
At first, the training process is normal. But when the model is saved, or in another word, just after the 200-th step, the process crushed with some ambiguous log.
{'loss': 0.4006, 'learning_rate': 1e-05, 'epoch': 0.65}
{'loss': 0.3654, 'learning_rate': 1e-05, 'epoch': 0.65}
{'loss': 0.4261, 'learning_rate': 1e-05, 'epoch': 0.65}
13%|█▎ | 200/1530 [4:51:45<31:34:20, 85.46s/it]/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/nn/modules/module.py:1879: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
warnings.warn(
[E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800195 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800195 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800563 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3310, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800563 milliseconds before timing out.
[2024-03-13 04:05:15,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1989172 closing signal SIGTERM
[2024-03-13 04:05:45,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1989172 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL
[2024-03-13 05:17:52,533] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 1989173) of binary: /data/pink/anaconda3/envs/qwen/bin/python
Traceback (most recent call last):
File "/data/pink/anaconda3/envs/qwen/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.1.2', 'console_scripts', 'torchrun')())
File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/pink/anaconda3/envs/qwen/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
finetune.py FAILED
--------------------------------------------------------
Failures:
[1]:
time : 2024-03-13_04:05:15
host : amax
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 1989174)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1989174
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-13_04:05:15
host : amax
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 1989173)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1989173
========================================================
I have tried the suggestion from hiyouga/LLaMA-Factory#1683 (comment) and add export NCCL_P2P_LEVEL=NVL before runing the script. But it crashed in the same way.
The output dir contains the parameters and it looks like a normal lora dir.
期望行为 | Expected Behavior
I am expecting the finetuning scripts not failing.
是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?
该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?
当前行为 | Current Behavior
I am trying to finetune the Qwen-VL-Chat model with lora on my own dataset by using the
finetune_lora_ds.sh
in the repository. My device is V100-32G. I modify the script as followed.At first, the training process is normal. But when the model is saved, or in another word, just after the 200-th step, the process crushed with some ambiguous log.
I have tried the suggestion from hiyouga/LLaMA-Factory#1683 (comment) and add
export NCCL_P2P_LEVEL=NVL
before runing the script. But it crashed in the same way.The output dir contains the parameters and it looks like a normal lora dir.
期望行为 | Expected Behavior
I am expecting the finetuning scripts not failing.
复现方法 | Steps To Reproduce
finetune_lora_ds.sh
运行环境 | Environment
备注 | Anything else?
No response
The text was updated successfully, but these errors were encountered: