Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocess_dataset dist.barrier crashed with NCCL communicator Socket Timeout #360

Closed
songkq opened this issue Aug 4, 2023 · 6 comments
Closed
Labels
solved This problem has been already solved

Comments

@songkq
Copy link

songkq commented Aug 4, 2023

During training Qwen-7B (sft) using this commit 2780792, preprocess_dataset crashed with NCCL communicator Socket Timeout. Could you please give some advice for this issue?

Running tokenizer on dataset:  88%|████████▊ | 383700/433643 [29:55<03:25, 2432.70 examples/s]Traceback (most recent call last):
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/train_bash.py", line 14, in <module>
    main()
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/train_bash.py", line 5, in main
    run_exp()
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/llmtuner/tuner/tune.py", line 21, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, callbacks)
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/llmtuner/tuner/sft/workflow.py", line 28, in run_sft
    dataset = preprocess_dataset(dataset, tokenizer, data_args, training_args, stage="sft")
  File "/workspace/LLaMA-Efficient-Tuning-dev/src/llmtuner/dsets/preprocess.py", line 160, in preprocess_dataset
    with training_args.main_process_first(desc="dataset map pre-processing"):
  File "/workspace/local/anaconda3/envs/llama_etuning_py310/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/workspace/local/anaconda3/envs/llama_etuning_py310/lib/python3.10/site-packages/transformers/training_args.py", line 1978, in main_process_first
    dist.barrier()
  File "/workspace/local/anaconda3/envs/llama_etuning_py310/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: [6] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f63c8f214d7 in /worklspace/local/anaconda3/envs/llama_etuning_py310/lib/python3.10/site-packages/torch/lib/libc10.so)
...
@hiyouga hiyouga added the pending This problem is yet to be addressed label Aug 5, 2023
@hiyouga
Copy link
Owner

hiyouga commented Aug 5, 2023

increase NCCL timeout threshold or using dataset streaming

@songkq
Copy link
Author

songkq commented Aug 10, 2023

Thanks.

@songkq songkq closed this as completed Aug 10, 2023
@hiyouga hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Aug 10, 2023
@fengcai24
Copy link

请问如何提高 NCCL 超时的阈值呢

@hiyouga
Copy link
Owner

hiyouga commented Aug 17, 2023

@fengcai24 #74 (comment)

@fengcai24
Copy link

谢谢!

@yawzhe
Copy link

yawzhe commented Mar 19, 2024

增加 NCCL 超时阈值或使用数据集流

怎么使用呀,nccl超时怎么弄呀,我用了ddp_time 发现加载数据的时候有两个run_tokenizre_data,进度条第一个能顺利加载大量数据,但是第二个进度条就不行了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

4 participants