-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
单节点 多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267
Comments
请问解决了吗?我在2个8卡节点上,通过指定device和修改deepspeed.json,跑2节点4卡的实验(QWen,全量微调),也会报类似的错误。 |
你好请问这个问题解决了么,我也遇到了同样的问题。 |
请问解决了吗 |
我也是,我在单节点2GPU(3张GPU也是类似的错误)情况下也会报类似错误 |
我解决了,就是显存 不够,160g用zero3就不报这个错了。 |
好的,谢谢,我怀疑是不是显存不够,但是不确定 |
环境
python 3.9
torch 2.0cu117
deepspeed 0.9.5
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x56 (0x7eff2a3fb946 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x78 (0x7eff2a3f7af8 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x354 (0x7eff2a6901b4 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x87 (0x7eff2ba52767 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7eff2ba54500 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x132 (0x7eff2ba540f2 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xceed0 (0x7eff83801ed0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x7ea7 (0x7effa1fe0ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x3f (0x7effa1d63a2f in /lib/x86_64-linux-gnu/libc.so.6)
Traceback (most recent call last):
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/train_bash.py", line 22, in
main()
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/train_bash.py", line 9, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/llmtuner/tuner/sft/workflow.py", line 61, in run_sft
train_result = trainer.train()
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = self._prepare_deepspeed(*args)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 309, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1184, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1419, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 484, in init
self.initialize_optimizer_states()
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 619, in initialize_optimizer_states
self.optimizer.step()
File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py", line 33, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 171, in step
adamw(
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 321, in adamw
func(
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 502, in _multi_tensor_adamw
torch.foreach_mul(device_exp_avg_sqs, beta2)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2417 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2415) of binary: /usr/bin/python3
The text was updated successfully, but these errors were encountered: