preprocess_dataset dist.barrier crashed with NCCL communicator Socket Timeout #360

songkq · 2023-08-04T23:13:33Z

During training Qwen-7B (sft) using this commit 2780792, preprocess_dataset crashed with NCCL communicator Socket Timeout. Could you please give some advice for this issue?

Running tokenizer on dataset:  88%|████████▊ | 383700/433643 [29:55<03:25, 2432.70 examples/s]Traceback (most recent call last):
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/train_bash.py", line 14, in <module>
    main()
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/train_bash.py", line 5, in main
    run_exp()
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/llmtuner/tuner/tune.py", line 21, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/llmtuner/tuner/sft/workflow.py", line 28, in run_sft
    dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, stage="sft")
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/llmtuner/dsets/preprocess.py", line 160, in preprocess_dataset
    with training_args.main_process_first(desc="dataset map pre-processing"):
  File "/workspace/local/anaconda3/envs/llama_etuning_py310/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/workspace/local/anaconda3/envs/llama_etuning_py310/lib/python3.10/site-packages/transformers/training_args.py", line 1978, in main_process_first
    dist.barrier()
  File "/workspace/local/anaconda3/envs/llama_etuning_py310/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f63c8f214d7 in /worklspace/local/anaconda3/envs/llama_etuning_py310/lib/python3.10/site-packages/torch/lib/libc10.so)
...

The text was updated successfully, but these errors were encountered:

hiyouga · 2023-08-05T05:11:17Z

increase NCCL timeout threshold or using dataset streaming

songkq · 2023-08-10T08:38:54Z

Thanks.

fengcai24 · 2023-08-17T01:45:43Z

请问如何提高 NCCL 超时的阈值呢

hiyouga · 2023-08-17T02:19:44Z

@fengcai24 #74 (comment)

fengcai24 · 2023-08-17T03:49:07Z

谢谢！

yawzhe · 2024-03-19T01:59:21Z

增加 NCCL 超时阈值或使用数据集流

怎么使用呀，nccl超时怎么弄呀，我用了ddp_time 发现加载数据的时候有两个run_tokenizre_data,进度条第一个能顺利加载大量数据，但是第二个进度条就不行了

hiyouga added the pending This problem is yet to be addressed label Aug 5, 2023

songkq closed this as completed Aug 10, 2023

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Aug 10, 2023

godfly mentioned this issue Aug 17, 2023

大数据量全参数预训练报错、流式读数据报错 #549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocess_dataset dist.barrier crashed with NCCL communicator Socket Timeout #360

preprocess_dataset dist.barrier crashed with NCCL communicator Socket Timeout #360

songkq commented Aug 4, 2023 •

edited

Loading

hiyouga commented Aug 5, 2023 •

edited

Loading

songkq commented Aug 10, 2023

fengcai24 commented Aug 17, 2023

hiyouga commented Aug 17, 2023

fengcai24 commented Aug 17, 2023

yawzhe commented Mar 19, 2024

preprocess_dataset dist.barrier crashed with NCCL communicator Socket Timeout #360

preprocess_dataset dist.barrier crashed with NCCL communicator Socket Timeout #360

Comments

songkq commented Aug 4, 2023 • edited Loading

hiyouga commented Aug 5, 2023 • edited Loading

songkq commented Aug 10, 2023

fengcai24 commented Aug 17, 2023

hiyouga commented Aug 17, 2023

fengcai24 commented Aug 17, 2023

yawzhe commented Mar 19, 2024

songkq commented Aug 4, 2023 •

edited

Loading

hiyouga commented Aug 5, 2023 •

edited

Loading