torch.distributed.elastic.multiprocessing.errors.ChildFailedError #7

WangRongsheng · 2023-06-04T02:38:40Z

训练指令：

accelerate launch src/train_sft.py \
    --model_name_or_path llama-hf/llama-13b-hf \
    --do_train \
    --dataset ChangChunTeng \
    --finetuning_type lora \
    --output_dir CCT/sft \
    --overwrite_cache \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs 3.0 \
    --resume_lora_training False \
    --plot_loss \
    --fp16

WangRongsheng · 2023-06-04T03:06:44Z

Running tokenizer on dataset:   0%|                                                                | 0/226042 [00:00<?, ? examples/s]06/04/2023 11:06:14 - INFO - datasets.arrow_dataset - Caching processed dataset at /root/.cache/huggingface/datasets/wangrongsheng___json/wangrongsheng--ChangChunTeng-220k-d576ed39544bf546/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-7900d18e31cbb541.arrow
Traceback (most recent call last):
  File "/tmp/CCT/src/train_sft.py", line 97, in <module>
    main()
  File "/tmp/CCT/src/train_sft.py", line 27, in main
    dataset = preprocess_data(dataset, tokenizer, data_args, training_args, stage="sft")
  File "/tmp/CCT/src/utils/common.py", line 475, in preprocess_data
    print_supervised_dataset_example(dataset[0])
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2778, in __getitem__
    return self._getitem(key)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 2762, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 578, in query_table
    _check_valid_index_key(key, size)
  File "/root/miniconda3/envs/xray/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 521, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 0 is out of bounds for size 0

WangRongsheng · 2023-06-04T03:49:56Z

请遵循以下数据格式，并且prompt不能为空：

"数据集名称": {
    "hf_hub_url": "HuggingFace上的项目地址（若指定，则忽略下列三个参数）",
    "script_url": "包含数据加载脚本的本地文件夹名称（若指定，则忽略下列两个参数）",
    "file_name": "该目录下数据集文件的名称（若上述参数未指定，则此项必需）",
    "file_sha1": "数据集文件的SHA-1哈希值（可选）",
    "columns": {
        "prompt": "数据集代表提示词的表头名称（默认：instruction）",
        "query": "数据集代表请求的表头名称（默认：input）",
        "response": "数据集代表回答的表头名称（默认：output）",
        "history": "数据集代表历史对话的表头名称（默认：None）"
    }
}

hiyouga · 2023-06-04T05:20:05Z

这种问题通常是 dataset_info.json 中的 columns 定义有误，需要检查数据集定义。

WangRongsheng closed this as completed Jun 4, 2023

hiyouga added the solved This problem has been already solved label Jun 4, 2023

DBtxy mentioned this issue Jul 27, 2023

单节点多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

Closed

godfly mentioned this issue Aug 17, 2023

大数据量全参数预训练报错、流式读数据报错 #549

Closed

HaimianYu mentioned this issue Nov 24, 2023

deepspeed多机多卡，训练以第一个batch卡住，然后报错Socket Timeout #1630

Closed

1 task

feria-tu mentioned this issue May 17, 2024

在昇腾npu环境下运行报错 #3779

Closed

1 task

zjxxsr mentioned this issue Jun 11, 2024

使用单机多卡微调Qwen2-72B #4205

Closed

1 task

Mr-Otaku-Lin mentioned this issue Jun 13, 2024

Qwen2-7B lora训练后推理出错 #4251

Closed

1 task

zhoushaoxiang mentioned this issue Jun 14, 2024

Ascend-D910 训练 RuntimeError: SET StreamOverflowSwitch Failed. #4284

Closed

1 task

ldknight mentioned this issue Jul 2, 2024

glm4在stage==rm微调时评估出现：CUDA error: device-side assert triggered #4646

Closed

1 task

fuqiang-benz mentioned this issue Jul 15, 2024

昇腾910b推理baichuan2-13B模型报错：The operator 'aten::isin.Tensor_Tensor_out' is not currently supported on the NPU backend and will 待解决 #4836

Closed

1 task

hiennguyennq mentioned this issue Oct 21, 2024

distributed training: using GPU 0 to perform barrier as devices used by this process are currently unknown. #5769

Open

yewenpeng mentioned this issue Nov 12, 2024

单机多卡运行报错，8卡h20 #6000

Open

1 task

asdksadsad mentioned this issue Dec 19, 2024

昇腾NPU使用API推理报错 #3796

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #7

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #7

WangRongsheng commented Jun 4, 2023

WangRongsheng commented Jun 4, 2023

WangRongsheng commented Jun 4, 2023

hiyouga commented Jun 4, 2023

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #7

torch.distributed.elastic.multiprocessing.errors.ChildFailedError #7

Comments

WangRongsheng commented Jun 4, 2023

WangRongsheng commented Jun 4, 2023

WangRongsheng commented Jun 4, 2023

hiyouga commented Jun 4, 2023