Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed多机多卡,训练以第一个batch卡住,然后报错Socket Timeout #1630

Closed
1 task done
HaimianYu opened this issue Nov 24, 2023 · 2 comments
Closed
1 task done
Labels
solved This problem has been already solved

Comments

@HaimianYu
Copy link

HaimianYu commented Nov 24, 2023

Reminder

  • I have read the README and searched the existing issues.

Reproduction

两张卡,用lora或者全量都会在第一个batch的时候卡死,然后报超时。master节点显存占了,利用率100%,第二个节点的卡占了显存,利用率0%。

启动脚本如下:
export NCCL_DEBUG=INFO
3 #export NCCL_IB_DISABLE=1
4 #export CUDA_LAUNCH_BLOCKING=1
5
6 deepspeed --hostfile hostfile --master_addr 10.197.226.129 --master_port 9993 src/train_bash.py
7 --deepspeed ds_config.json
8 --stage pt
9 --model_name_or_path ../models/chatglm3-6b/
10 --do_train
11 --dataset wiki_demo
12 --finetuning_type lora
13 --lora_target query_key_value
14 --output_dir ../ckpts/Baichuan-7B/
15 --overwrite_cache
16 --per_device_train_batch_size 2
17 --gradient_accumulation_steps 2
18 --lr_scheduler_type cosine
19 --logging_steps 10
20 --save_steps 1000
21 --learning_rate 5e-5
22 --num_train_epochs 30000.0
23 --plot_loss
24 --fp16
25 --cache_path ../datasets/cache/wiki

deepspeed配置如下;
{
2 train_batch_size: auto,
3 train_micro_batch_size_per_gpu: auto,
4 gradient_accumulation_steps: auto,
5 gradient_clipping: auto,
6 zero_allow_untested_optimizer: true,
7 fp16: {
8 enabled: auto,
9 loss_scale: 0,
10 initial_scale_power: 16,
11 loss_scale_window: 1000,
12 hysteresis: 2,
13 min_loss_scale: 1
14 },
15 zero_optimization: {
16 stage: 2,
17 allgather_partitions: true,
18 allgather_bucket_size: 5e8,
19 reduce_scatter: true,
20 reduce_bucket_size: 5e8,
21 overlap_comm: false,
22 contiguous_gradients: true
23 }
24 }

Expected behavior

报错如下:

0%| | 0/2550000 [00:00<?, ?it/s]Traceback (most recent call last): [203/2138]
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 14, in
worker02: Traceback (most recent call last):
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 14, in
worker02: main()
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 5, in main
worker02: main()
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 5, in main
worker02: run_exp()
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/tuner.py", line 25, in run_ex$
worker02: run_exp()
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/tuner.py", line 25, in run_ex$
worker02: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 41, in $
un_pt
worker02: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 41, in $
un_pt
worker02: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train
worker02: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train
worker02: return inner_training_loop(
worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop
worker02: return inner_training_loop(
worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop
worker02: for step, inputs in enumerate(epoch_iterator):
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter
worker02: for step, inputs in enumerate(epoch_iterator):
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter
worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states
worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator)
worker02: synchronize_rng_state(RNGType(rng_type), generator=generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter
worker02: for step, inputs in enumerate(epoch_iterator):
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter
worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states
worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator)
worker02: synchronize_rng_state(RNGType(rng_type), generator=generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 89, in synchronize_rng_state
worker02: torch.distributed.broadcast(rng_state, 0)
worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
worker02: synchronize_rng_state(RNGType(rng_type), generator=generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 89, in synchronize_rng_state
worker02: torch.distributed.broadcast(rng_state, 0)
worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
worker02: return func(*args, *kwargs)
worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast
worker02: return func(args, kwargs)
worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast
worker02: work = default_pg.broadcast([tensor], opts)
worker02: RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by
key '0', but store->get('0') got error: Socket Timeout
worker02: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first):
worker02: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efde17064d7 in /usr/local/lib/python3.9/si
te-packages/torch/lib/libc10.so)
worker02: frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, char const
) + 0x68 (0x7efde16d0434 i
n /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
worker02: frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) +
0xd8 (0x7efe0cc72a78 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
worker02: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7efe0cc73722 in /usr/local/lib/python3.9/site-package
s/torch/lib/libtorch_cpu.so)
worker02: frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7efe0cc737a9 in /usr/local/lib/python3.9/site-packages/
torch/lib/libtorch_cpu.so)
worker02: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag
es/torch/lib/libtorch_cpu.so)
worker02: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag
es/torch/lib/libtorch_cpu.so)
worker02: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag
es/torch/lib/libtorch_cpu.so)
worker02: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId
, bool, std::string const&, int) + 0xaf (0x7ef
de2695c4f in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
worker02: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Devi
ce> > const&, c10d::OpType, int, bool) + 0x201 (0x7efde2699901 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_c
uda.so)
worker02: frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::Broadcast
Options const&) + 0x40a (0x7efde26a5a5a in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

System Info

No response

Others

No response

@hiyouga hiyouga added wontfix This will not be worked on and removed wontfix This will not be worked on labels Dec 1, 2023
@hiyouga
Copy link
Owner

hiyouga commented Dec 1, 2023

#1683

@hiyouga hiyouga added the solved This problem has been already solved label Dec 1, 2023
@hiyouga hiyouga closed this as completed Dec 1, 2023
@bravelyi
Copy link

您好,麻烦问一下您这个是多机多卡的配置吗deepspeed --hostfile hostfile --master_addr 10.197.226.129 --master_port 9993 src/train_bash.py,可以看看你的hostfile文件的的配置吗,还有一个疑问就是多机,这里为什么只写一个master_addr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved
Projects
None yet
Development

No branches or pull requests

3 participants