We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
两张卡,用lora或者全量都会在第一个batch的时候卡死,然后报超时。master节点显存占了,利用率100%,第二个节点的卡占了显存,利用率0%。
启动脚本如下: export NCCL_DEBUG=INFO 3 #export NCCL_IB_DISABLE=1 4 #export CUDA_LAUNCH_BLOCKING=1 5 6 deepspeed --hostfile hostfile --master_addr 10.197.226.129 --master_port 9993 src/train_bash.py 7 --deepspeed ds_config.json 8 --stage pt 9 --model_name_or_path ../models/chatglm3-6b/ 10 --do_train 11 --dataset wiki_demo 12 --finetuning_type lora 13 --lora_target query_key_value 14 --output_dir ../ckpts/Baichuan-7B/ 15 --overwrite_cache 16 --per_device_train_batch_size 2 17 --gradient_accumulation_steps 2 18 --lr_scheduler_type cosine 19 --logging_steps 10 20 --save_steps 1000 21 --learning_rate 5e-5 22 --num_train_epochs 30000.0 23 --plot_loss 24 --fp16 25 --cache_path ../datasets/cache/wiki
deepspeed配置如下; { 2 train_batch_size: auto, 3 train_micro_batch_size_per_gpu: auto, 4 gradient_accumulation_steps: auto, 5 gradient_clipping: auto, 6 zero_allow_untested_optimizer: true, 7 fp16: { 8 enabled: auto, 9 loss_scale: 0, 10 initial_scale_power: 16, 11 loss_scale_window: 1000, 12 hysteresis: 2, 13 min_loss_scale: 1 14 }, 15 zero_optimization: { 16 stage: 2, 17 allgather_partitions: true, 18 allgather_bucket_size: 5e8, 19 reduce_scatter: true, 20 reduce_bucket_size: 5e8, 21 overlap_comm: false, 22 contiguous_gradients: true 23 } 24 }
报错如下:
0%| | 0/2550000 [00:00<?, ?it/s]Traceback (most recent call last): [203/2138] worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 14, in worker02: Traceback (most recent call last): worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 14, in worker02: main() worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 5, in main worker02: main() worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 5, in main worker02: run_exp() worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/tuner.py", line 25, in run_ex$ worker02: run_exp() worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/tuner.py", line 25, in run_ex$ worker02: run_pt(model_args, data_args, training_args, finetuning_args, callbacks) worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 41, in $ un_pt worker02: run_pt(model_args, data_args, training_args, finetuning_args, callbacks) worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 41, in $ un_pt worker02: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train worker02: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train worker02: return inner_training_loop( worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop worker02: return inner_training_loop( worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop worker02: for step, inputs in enumerate(epoch_iterator): worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter worker02: for step, inputs in enumerate(epoch_iterator): worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator) worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator) worker02: synchronize_rng_state(RNGType(rng_type), generator=generator) worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter worker02: for step, inputs in enumerate(epoch_iterator): worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator) worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator) worker02: synchronize_rng_state(RNGType(rng_type), generator=generator) worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 89, in synchronize_rng_state worker02: torch.distributed.broadcast(rng_state, 0) worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper worker02: synchronize_rng_state(RNGType(rng_type), generator=generator) worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 89, in synchronize_rng_state worker02: torch.distributed.broadcast(rng_state, 0) worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper worker02: return func(*args, *kwargs) worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast worker02: return func(args, kwargs) worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast worker02: work = default_pg.broadcast([tensor], opts) worker02: RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout worker02: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first): worker02: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efde17064d7 in /usr/local/lib/python3.9/si te-packages/torch/lib/libc10.so) worker02: frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, char const) + 0x68 (0x7efde16d0434 i n /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so) worker02: frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7efe0cc72a78 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so) worker02: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7efe0cc73722 in /usr/local/lib/python3.9/site-package s/torch/lib/libtorch_cpu.so) worker02: frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7efe0cc737a9 in /usr/local/lib/python3.9/site-packages/ torch/lib/libtorch_cpu.so) worker02: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag es/torch/lib/libtorch_cpu.so) worker02: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag es/torch/lib/libtorch_cpu.so) worker02: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag es/torch/lib/libtorch_cpu.so) worker02: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId, bool, std::string const&, int) + 0xaf (0x7ef de2695c4f in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so) worker02: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Devi ce> > const&, c10d::OpType, int, bool) + 0x201 (0x7efde2699901 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_c uda.so) worker02: frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::Broadcast Options const&) + 0x40a (0x7efde26a5a5a in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
No response
The text was updated successfully, but these errors were encountered:
#1683
Sorry, something went wrong.
您好,麻烦问一下您这个是多机多卡的配置吗deepspeed --hostfile hostfile --master_addr 10.197.226.129 --master_port 9993 src/train_bash.py,可以看看你的hostfile文件的的配置吗,还有一个疑问就是多机,这里为什么只写一个master_addr
No branches or pull requests
Reminder
Reproduction
两张卡,用lora或者全量都会在第一个batch的时候卡死,然后报超时。master节点显存占了,利用率100%,第二个节点的卡占了显存,利用率0%。
启动脚本如下:
export NCCL_DEBUG=INFO
3 #export NCCL_IB_DISABLE=1
4 #export CUDA_LAUNCH_BLOCKING=1
5
6 deepspeed --hostfile hostfile --master_addr 10.197.226.129 --master_port 9993 src/train_bash.py
7 --deepspeed ds_config.json
8 --stage pt
9 --model_name_or_path ../models/chatglm3-6b/
10 --do_train
11 --dataset wiki_demo
12 --finetuning_type lora
13 --lora_target query_key_value
14 --output_dir ../ckpts/Baichuan-7B/
15 --overwrite_cache
16 --per_device_train_batch_size 2
17 --gradient_accumulation_steps 2
18 --lr_scheduler_type cosine
19 --logging_steps 10
20 --save_steps 1000
21 --learning_rate 5e-5
22 --num_train_epochs 30000.0
23 --plot_loss
24 --fp16
25 --cache_path ../datasets/cache/wiki
deepspeed配置如下;
{
2 train_batch_size: auto,
3 train_micro_batch_size_per_gpu: auto,
4 gradient_accumulation_steps: auto,
5 gradient_clipping: auto,
6 zero_allow_untested_optimizer: true,
7 fp16: {
8 enabled: auto,
9 loss_scale: 0,
10 initial_scale_power: 16,
11 loss_scale_window: 1000,
12 hysteresis: 2,
13 min_loss_scale: 1
14 },
15 zero_optimization: {
16 stage: 2,
17 allgather_partitions: true,
18 allgather_bucket_size: 5e8,
19 reduce_scatter: true,
20 reduce_bucket_size: 5e8,
21 overlap_comm: false,
22 contiguous_gradients: true
23 }
24 }
Expected behavior
报错如下:
0%| | 0/2550000 [00:00<?, ?it/s]Traceback (most recent call last): [203/2138]
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 14, in
worker02: Traceback (most recent call last):
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 14, in
worker02: main()
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 5, in main
worker02: main()
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/train_bash.py", line 5, in main
worker02: run_exp()
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/tuner.py", line 25, in run_ex$
worker02: run_exp()
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/tuner.py", line 25, in run_ex$
worker02: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 41, in $
un_pt
worker02: run_pt(model_args, data_args, training_args, finetuning_args, callbacks)
worker02: File "/imagecenter_new/workspace/wy/workspace/LLM/LLaMA-Factory/src/llmtuner/train/pt/workflow.py", line 41, in $
un_pt
worker02: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train
worker02: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train
worker02: return inner_training_loop(
worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop
worker02: return inner_training_loop(
worker02: File "/usr/local/lib/python3.9/site-packages/transformers/trainer.py", line 1813, in _inner_training_loop
worker02: for step, inputs in enumerate(epoch_iterator):
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter
worker02: for step, inputs in enumerate(epoch_iterator):
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter
worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states
worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator)
worker02: synchronize_rng_state(RNGType(rng_type), generator=generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter
worker02: for step, inputs in enumerate(epoch_iterator):
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/data_loader.py", line 379, in iter
worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states
worker02: synchronize_rng_states(self.rng_types, self.synchronized_generator)
worker02: synchronize_rng_state(RNGType(rng_type), generator=generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 111, in synchronize_rng_states
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 89, in synchronize_rng_state
worker02: torch.distributed.broadcast(rng_state, 0)
worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
worker02: synchronize_rng_state(RNGType(rng_type), generator=generator)
worker02: File "/usr/local/lib/python3.9/site-packages/accelerate/utils/random.py", line 89, in synchronize_rng_state
worker02: torch.distributed.broadcast(rng_state, 0)
worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
worker02: return func(*args, *kwargs)
worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast
worker02: return func(args, kwargs)
worker02: File "/usr/local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast
worker02: work = default_pg.broadcast([tensor], opts)
worker02: RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by
key '0', but store->get('0') got error: Socket Timeout
worker02: Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first):
worker02: frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efde17064d7 in /usr/local/lib/python3.9/si
te-packages/torch/lib/libc10.so)
worker02: frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, char const) + 0x68 (0x7efde16d0434 i
n /usr/local/lib/python3.9/site-packages/torch/lib/libc10.so)
worker02: frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) +
0xd8 (0x7efe0cc72a78 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so)
worker02: frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7efe0cc73722 in /usr/local/lib/python3.9/site-package
s/torch/lib/libtorch_cpu.so)
worker02: frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7efe0cc737a9 in /usr/local/lib/python3.9/site-packages/
torch/lib/libtorch_cpu.so)
worker02: frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag
es/torch/lib/libtorch_cpu.so)
worker02: frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag
es/torch/lib/libtorch_cpu.so)
worker02: frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7efe0cc32c71 in /usr/local/lib/python3.9/site-packag
es/torch/lib/libtorch_cpu.so)
worker02: frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId, bool, std::string const&, int) + 0xaf (0x7ef
de2695c4f in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
worker02: frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Devi
ce> > const&, c10d::OpType, int, bool) + 0x201 (0x7efde2699901 in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_c
uda.so)
worker02: frame #10: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::Broadcast
Options const&) + 0x40a (0x7efde26a5a5a in /usr/local/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: