使用accelerate和deepspeed进行多卡微调LLM卡住 #1683

Aitejiu · 2023-11-30T03:27:19Z

Reminder

I have read the README and searched the existing issues.

Reproduction

deepspeed

deepspeed --num_gpus 2 --master_port=9901 src/train_bash.py \
    --deepspeed ds_config.json \
    --stage sft \
    --model_name_or_path /home/zhmao/model/Baichuan-13B-chat \
    --do_train \
    --dataset alpaca_gpt4_zh\
    --template baichuan \
    --finetuning_type lora \
    --lora_rank 32 \
    --lora_target all \
    --output_dir /home/zhmao/model/Baichuan-13B-QLoRA \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 16 \
    --cutoff_len 1024 \
    --optim paged_adamw_32bit \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --warmup_steps 100 \
    --learning_rate 3e-5 \
    --max_grad_norm 0.5 \
    --num_train_epochs 2.0 \
    --quantization_bit 4 \
    --plot_loss \
    --fp16

deepspeed配置参数

{
    "train_micro_batch_size_per_gpu": "auto",
    "zero_allow_untested_optimizer": true,
    "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "initial_scale_power": 16, 
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1
    },  
    "zero_optimization": {
      "stage": 2,
      "allgather_partitions": true,
      "allgather_bucket_size": 5e8,
      "overlap_comm": false,
      "reduce_scatter": true,
      "reduce_bucket_size": 5e8,
      "contiguous_gradients" : true
    }
  }

accelerate

accelerate launch /home/zhmao/.cache/huggingface/accelerate/default_config.yaml \
    src/train_bash.py \
    --stage sft \
    --model_name_or_path /home/zhmao/model/Baichuan-13B-chat \
    --do_train \
    --dataset alpaca_gpt4_zh\
    --template baichuan \
    --finetuning_type lora \
    --lora_rank 32 \
    --lora_target all \
    --output_dir /home/zhmao/model/Baichuan-13B-QLoRA \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --preprocessing_num_workers 16 \
    --cutoff_len 1024 \
    --optim paged_adamw_32bit \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --warmup_steps 100 \
    --learning_rate 3e-5 \
    --max_grad_norm 0.5 \
    --num_train_epochs 2.0 \
    --quantization_bit 4 \
    --plot_loss \
    --fp16

例子中使用的Baichuan-13B-chat，但我尝试了ChatGLM2-6B也同样卡住
我已经查询过的issue：
#74
#1651

Expected behavior

期望正确的使用多卡进行加速

System Info

transformers version: 4.33.2
Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.27
Python version: 3.9.0
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.0
Accelerate version: 0.24.1
Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
PyTorch version (GPU?): 2.1.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

*GPU：A6000（40G）2

Others

终端所有输出

[2023-11-30 11:24:07,852] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-30 11:24:09,226] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-11-30 11:24:09,226] [INFO] [runner.py:570:main] cmd = /home/zhmao/anaconda3/envs/LLaMa-factory/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=9901 --enable_each_rank_log=None src/train_bash.py --deepspeed ds_config.json --stage sft --model_name_or_path /home/zhmao/model/Baichuan-13B-chat --do_train --dataset alpaca_gpt4_zh --template baichuan --finetuning_type lora --lora_rank 32 --lora_target all --output_dir /home/zhmao/model/Baichuan-13B-QLoRA --per_device_train_batch_size 4 --gradient_accumulation_steps 8 --preprocessing_num_workers 16 --cutoff_len 1024 --optim paged_adamw_32bit --lr_scheduler_type cosine --logging_steps 10 --save_steps 100 --eval_steps 100 --warmup_steps 100 --learning_rate 3e-5 --max_grad_norm 0.5 --num_train_epochs 2.0 --quantization_bit 4 --plot_loss --fp16
[2023-11-30 11:24:11,215] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-30 11:24:12,559] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-11-30 11:24:12,559] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-11-30 11:24:12,559] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-11-30 11:24:12,559] [INFO] [launch.py:163:main] dist_world_size=2
[2023-11-30 11:24:12,559] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-11-30 11:24:16,369] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-11-30 11:24:16,393] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/zhmao/anaconda3/envs/LLaMa-factory/lib/python3.9/site-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
/home/zhmao/anaconda3/envs/LLaMa-factory/lib/python3.9/site-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
[2023-11-30 11:24:18,123] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-11-30 11:24:18,315] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-11-30 11:24:18,315] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
11/30/2023 11:24:19 - WARNING - llmtuner.model.parser - We recommend enable `upcast_layernorm` in quantized training.
11/30/2023 11:24:19 - WARNING - llmtuner.model.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
[INFO|training_args.py:1332] 2023-11-30 11:24:19,136 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1764] 2023-11-30 11:24:19,136 >> PyTorch: setting up devices
/home/zhmao/anaconda3/envs/LLaMa-factory/lib/python3.9/site-packages/transformers/training_args.py:1677: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
11/30/2023 11:24:19 - INFO - llmtuner.model.parser - Process rank: 0, device: cuda:0, n_gpu: 1
  distributed training: True, compute dtype: torch.float16
11/30/2023 11:24:19 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=ds_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=100.0,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=3e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/home/zhmao/model/Baichuan-13B-QLoRA/runs/Nov30_11-24-18_qlb-AS-4124GS-TNR,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=0.5,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2.0,
optim=paged_adamw_32bit,
optim_args=None,
output_dir=/home/zhmao/model/Baichuan-13B-QLoRA,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=4,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=/home/zhmao/model/Baichuan-13B-QLoRA,
save_on_each_node=False,
save_safetensors=False,
save_steps=100,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=100,
weight_decay=0.0,
)
11/30/2023 11:24:19 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
11/30/2023 11:24:19 - WARNING - llmtuner.model.parser - We recommend enable `upcast_layernorm` in quantized training.
11/30/2023 11:24:19 - WARNING - llmtuner.model.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
/home/zhmao/anaconda3/envs/LLaMa-factory/lib/python3.9/site-packages/transformers/training_args.py:1677: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
11/30/2023 11:24:19 - INFO - llmtuner.model.parser - Process rank: 1, device: cuda:1, n_gpu: 1
  distributed training: True, compute dtype: torch.float16
11/30/2023 11:24:19 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=ds_config.json,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=100.0,
evaluation_strategy=no,
fp16=True,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=3e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=1,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=/home/zhmao/model/Baichuan-13B-QLoRA/runs/Nov30_11-24-18_qlb-AS-4124GS-TNR,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=cosine,
max_grad_norm=0.5,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=2.0,
optim=paged_adamw_32bit,
optim_args=None,
output_dir=/home/zhmao/model/Baichuan-13B-QLoRA,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=4,
predict_with_generate=False,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=/home/zhmao/model/Baichuan-13B-QLoRA,
save_on_each_node=False,
save_safetensors=False,
save_steps=100,
save_strategy=steps,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=100,
weight_decay=0.0,
)
11/30/2023 11:24:19 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
Using custom data configuration default-4f195a63697fe826
Loading Dataset Infos from /home/zhmao/anaconda3/envs/LLaMa-factory/lib/python3.9/site-packages/datasets/packaged_modules/json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (/home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
[INFO|tokenization_utils_base.py:1850] 2023-11-30 11:24:20,331 >> loading file tokenizer.model
[INFO|tokenization_utils_base.py:1850] 2023-11-30 11:24:20,331 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1850] 2023-11-30 11:24:20,331 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1850] 2023-11-30 11:24:20,331 >> loading file tokenizer_config.json
[INFO|configuration_utils.py:713] 2023-11-30 11:24:20,356 >> loading configuration file /home/zhmao/model/Baichuan-13B-chat/config.json
[INFO|configuration_utils.py:713] 2023-11-30 11:24:20,357 >> loading configuration file /home/zhmao/model/Baichuan-13B-chat/config.json
[INFO|configuration_utils.py:775] 2023-11-30 11:24:20,358 >> Model config BaichuanConfig {
  "_from_model_config": true,
  "_name_or_path": "/home/zhmao/model/Baichuan-13B-chat",
  "architectures": [
    "BaichuanForCausalLM"
  ],
  "auto_map": {
    "AutoConfig": "configuration_baichuan.BaichuanConfig",
    "AutoModelForCausalLM": "modeling_baichuan.BaichuanForCausalLM"
  },
  "bos_token_id": 1,
  "eos_token_id": 2,
  "gradient_checkpointing": [
    false
  ],
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 13696,
  "model_max_length": 4096,
  "model_type": "baichuan",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "pad_token_id": 0,
  "rms_norm_eps": 1e-06,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.33.2",
  "use_cache": true,
  "vocab_size": 64000
}

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
11/30/2023 11:24:20 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
[INFO|modeling_utils.py:2866] 2023-11-30 11:24:20,382 >> loading weights file /home/zhmao/model/Baichuan-13B-chat/pytorch_model.bin.index.json
[INFO|modeling_utils.py:1200] 2023-11-30 11:24:20,382 >> Instantiating BaichuanForCausalLM model under default dtype torch.float16.
[INFO|configuration_utils.py:768] 2023-11-30 11:24:20,382 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 0,
  "transformers_version": "4.33.2"
}

[INFO|modeling_utils.py:2983] 2023-11-30 11:24:20,415 >> Detected 4-bit loading: activating 4-bit loading for this model
11/30/2023 11:24:20 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
Loading checkpoint shards: 100%|█████████████████████████████| 3/3 [00:14<00:00,  4.77s/it]
11/30/2023 11:24:35 - INFO - llmtuner.model.utils - Gradient checkpointing enabled.
11/30/2023 11:24:35 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
11/30/2023 11:24:35 - INFO - llmtuner.model.utils - Found linear modules: down_proj,W_pack,up_proj,o_proj,gate_proj
11/30/2023 11:24:36 - INFO - llmtuner.model.loader - trainable params: 111575040 || all params: 13376476160 || trainable%: 0.8341
Loading checkpoint shards: 100%|█████████████████████████████| 3/3 [00:15<00:00,  5.33s/it]
[INFO|modeling_utils.py:3655] 2023-11-30 11:24:36,546 >> All model checkpoint weights were used when initializing BaichuanForCausalLM.

[INFO|modeling_utils.py:3663] 2023-11-30 11:24:36,546 >> All the weights of BaichuanForCausalLM were initialized from the model checkpoint at /home/zhmao/model/Baichuan-13B-chat.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BaichuanForCausalLM for predictions without further training.
[INFO|configuration_utils.py:728] 2023-11-30 11:24:36,550 >> loading configuration file /home/zhmao/model/Baichuan-13B-chat/generation_config.json
[INFO|configuration_utils.py:768] 2023-11-30 11:24:36,550 >> Generate config GenerationConfig {
  "assistant_token_id": 196,
  "bos_token_id": 1,
  "do_sample": true,
  "eos_token_id": 2,
  "max_new_tokens": 2048,
  "pad_token_id": 0,
  "repetition_penalty": 1.1,
  "temperature": 0.3,
  "top_k": 5,
  "top_p": 0.85,
  "transformers_version": "4.33.2",
  "user_token_id": 195
}

11/30/2023 11:24:36 - INFO - llmtuner.model.utils - Gradient checkpointing enabled.
11/30/2023 11:24:36 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
11/30/2023 11:24:36 - INFO - llmtuner.model.utils - Found linear modules: down_proj,up_proj,o_proj,gate_proj,W_pack
11/30/2023 11:24:37 - INFO - llmtuner.model.loader - trainable params: 111575040 || all params: 13376476160 || trainable%: 0.8341
[INFO|tokenization_utils_base.py:926] 2023-11-30 11:24:37,912 >> Assigning [] to the additional_special_tokens key of the tokenizer
Process #0 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00000_of_00016.arrow
Process #1 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00001_of_00016.arrow
Process #2 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00002_of_00016.arrow
Process #3 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00003_of_00016.arrow
Process #4 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00004_of_00016.arrow
Process #5 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00005_of_00016.arrow
Process #6 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00006_of_00016.arrow
Process #7 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00007_of_00016.arrow
Process #8 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00008_of_00016.arrow
Process #9 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00009_of_00016.arrow
Process #10 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00010_of_00016.arrow
Process #11 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00011_of_00016.arrow
Process #12 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00012_of_00016.arrow
Process #13 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00013_of_00016.arrow
Process #14 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00014_of_00016.arrow
Process #15 will write at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_00015_of_00016.arrow
Loading cached processed dataset at /home/zhmao/.cache/huggingface/datasets/json/default-4f195a63697fe826/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-87dc526b91464992_*_of_00016.arrow
Concatenating 16 shards
input_ids:
[195, 31106, 4550, 19463, 7841, 7868, 73, 196, 31106, 4567, 31161, 4550, 19463, 7841, 7868, 77, 5, 5, 53, 79, 31106, 4550, 3606, 2148, 73, 3526, 31345, 11886, 31135, 3606, 4467, 72, 31248, 32188, 31583, 76, 21332, 31399, 17268, 72, 31196, 6520, 28165, 2337, 72, 7552, 12421, 6029, 72, 31404, 20387, 5972, 16573, 73, 5, 5, 54, 79, 31106, 24691, 9945, 73, 3526, 11164, 12420, 31135, 11748, 76, 11603, 76, 31233, 32570, 31368, 31188, 12019, 13443, 31664, 31135, 18085, 6768, 72, 6076, 31229, 32242, 76, 31229, 12019, 31188, 10523, 6186, 72, 31187, 4550, 19463, 9945, 6269, 73, 5, 5, 55, 79, 31106, 11923, 15932, 73, 11923, 31209, 7776, 2337, 31475, 31262, 2462, 72, 17951, 3526, 31363, 6196, 31106, 59, 31136, 60, 31106, 4237, 31135, 11923, 73, 9636, 11923, 20387, 17832, 6550, 72, 6520, 3606, 6691, 72, 31404, 3806, 3300, 22645, 9684, 31258, 73, 2]
inputs:
 <reserved_102> 保持健康的三个提示。<reserved_103> 以下是保持健康的三个提示：

1. 保持身体活动。每天做适当的身体运动，如散步、跑步或游泳，能促进心血管健康，增强肌肉力量，并有助于减少体重。

2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物，避免高糖、高脂肪和加工食品，以保持健康的饮食习惯。

3. 睡眠充足。睡眠对人体健康至关重要，成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力，促进身体恢复，并提高注意力和记忆力。</s>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, 31106, 4567, 31161, 4550, 19463, 7841, 7868, 77, 5, 5, 53, 79, 31106, 4550, 3606, 2148, 73, 3526, 31345, 11886, 31135, 3606, 4467, 72, 31248, 32188, 31583, 76, 21332, 31399, 17268, 72, 31196, 6520, 28165, 2337, 72, 7552, 12421, 6029, 72, 31404, 20387, 5972, 16573, 73, 5, 5, 54, 79, 31106, 24691, 9945, 73, 3526, 11164, 12420, 31135, 11748, 76, 11603, 76, 31233, 32570, 31368, 31188, 12019, 13443, 31664, 31135, 18085, 6768, 72, 6076, 31229, 32242, 76, 31229, 12019, 31188, 10523, 6186, 72, 31187, 4550, 19463, 9945, 6269, 73, 5, 5, 55, 79, 31106, 11923, 15932, 73, 11923, 31209, 7776, 2337, 31475, 31262, 2462, 72, 17951, 3526, 31363, 6196, 31106, 59, 31136, 60, 31106, 4237, 31135, 11923, 73, 9636, 11923, 20387, 17832, 6550, 72, 6520, 3606, 6691, 72, 31404, 3806, 3300, 22645, 9684, 31258, 73, 2]
labels:
 以下是保持健康的三个提示：

1. 保持身体活动。每天做适当的身体运动，如散步、跑步或游泳，能促进心血管健康，增强肌肉力量，并有助于减少体重。

2. 均衡饮食。每天食用新鲜的蔬菜、水果、全谷物和脂肪含量低的蛋白质食物，避免高糖、高脂肪和加工食品，以保持健康的饮食习惯。

3. 睡眠充足。睡眠对人体健康至关重要，成年人每天应保证 7-8 小时的睡眠。良好的睡眠有助于减轻压力，促进身体恢复，并提高注意力和记忆力。</s>

The text was updated successfully, but these errors were encountered:

hiyouga · 2023-11-30T12:32:47Z

#1651 (comment)

Aitejiu · 2023-12-01T02:30:37Z

#1651 (comment)

我尝试过了，deepspeed都试过了，最后是在微调前加入一条指令。
export NCCL_P2P_LEVEL=NVL

Aitejiu · 2023-12-02T01:55:27Z

#1651 (comment)

我尝试过了，deepspeed都试过了，最后是在微调前加入一条指令。 export NCCL_P2P_LEVEL=NVL

这里修改的指令，是将GPU之间的传输带宽等级提高。
问题的原因，我猜测是因为GPU之间的传输带宽等级太低，进程一直挂起。
如果想彻底解决，可以尝试
huggingface/accelerate#934
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

HUAFOR · 2023-12-10T15:41:07Z

export NCCL_P2P_LEVEL=NVL是直接在运行前加上这行命令就可以吗？

Len-Li · 2023-12-23T15:19:45Z

export NCCL_P2P_LEVEL=NVL是直接在运行前加上这行命令就可以吗？

我加上这一句解决了使用nccl时卡死的问题，加上就可以多卡微调了

lonelydancer · 2024-04-10T05:56:34Z

请问没有nvlink，数据量比较大是不是就卡死了。

hiyouga added the pending This problem is yet to be addressed label Nov 30, 2023

Aitejiu closed this as completed Dec 1, 2023

hiyouga added good first issue Good for newcomers solved This problem has been already solved and removed pending This problem is yet to be addressed labels Dec 1, 2023

hiyouga mentioned this issue Dec 1, 2023

deepspeed多机多卡，训练以第一个batch卡住，然后报错Socket Timeout #1630

Closed

1 task

hiyouga mentioned this issue Dec 8, 2023

deepspeed zero3 配置下 DPO 训练会出现训练进程卡死的问题，这是怎么回事呢？ #1775

Closed

1 task

hiyouga mentioned this issue Jan 23, 2024

fix: ZeRO3 does not work with MoE models #2283

Merged

milk-bottle-liyu mentioned this issue Mar 13, 2024

[BUG] The finetuning script failed just after saving the lora model. QwenLM/Qwen-VL#328

Closed

2 tasks

nkwangleiGIT mentioned this issue Apr 16, 2024

support fine-tuning with llama-factory on multiple gpus kubeagi/arcadia#1016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用accelerate和deepspeed进行多卡微调LLM卡住 #1683

使用accelerate和deepspeed进行多卡微调LLM卡住 #1683

Aitejiu commented Nov 30, 2023

hiyouga commented Nov 30, 2023

Aitejiu commented Dec 1, 2023 •

edited

Loading

Aitejiu commented Dec 2, 2023

HUAFOR commented Dec 10, 2023

Len-Li commented Dec 23, 2023

lonelydancer commented Apr 10, 2024

使用accelerate和deepspeed进行多卡微调LLM卡住 #1683

使用accelerate和deepspeed进行多卡微调LLM卡住 #1683

Comments

Aitejiu commented Nov 30, 2023

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented Nov 30, 2023

Aitejiu commented Dec 1, 2023 • edited Loading

Aitejiu commented Dec 2, 2023

HUAFOR commented Dec 10, 2023

Len-Li commented Dec 23, 2023

lonelydancer commented Apr 10, 2024

Aitejiu commented Dec 1, 2023 •

edited

Loading