Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

单节点 多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

Closed
DBtxy opened this issue Jul 27, 2023 · 6 comments
Labels
invalid This doesn't seem right

Comments

@DBtxy
Copy link

DBtxy commented Jul 27, 2023

环境
python 3.9
torch 2.0cu117
deepspeed 0.9.5

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x56 (0x7eff2a3fb946 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x78 (0x7eff2a3f7af8 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x354 (0x7eff2a6901b4 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x87 (0x7eff2ba52767 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7eff2ba54500 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x132 (0x7eff2ba540f2 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xceed0 (0x7eff83801ed0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x7ea7 (0x7effa1fe0ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x3f (0x7effa1d63a2f in /lib/x86_64-linux-gnu/libc.so.6)

Traceback (most recent call last):
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/train_bash.py", line 22, in
main()
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/train_bash.py", line 9, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/llmtuner/tuner/sft/workflow.py", line 61, in run_sft
train_result = trainer.train()
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = self._prepare_deepspeed(*args)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 309, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1184, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1419, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 484, in init
self.initialize_optimizer_states()
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 619, in initialize_optimizer_states
self.optimizer.step()
File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py", line 33, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 171, in step
adamw(
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 321, in adamw
func(
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 502, in _multi_tensor_adamw
torch.foreach_mul(device_exp_avg_sqs, beta2)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2417 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2415) of binary: /usr/bin/python3

@hiyouga hiyouga added the pending This problem is yet to be addressed label Jul 27, 2023
@hiyouga hiyouga added invalid This doesn't seem right and removed pending This problem is yet to be addressed labels Aug 11, 2023
@hiyouga hiyouga closed this as completed Aug 11, 2023
@YananSunn
Copy link

请问解决了吗?我在2个8卡节点上,通过指定device和修改deepspeed.json,跑2节点4卡的实验(QWen,全量微调),也会报类似的错误。
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1847, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/deepspeed.py", line 176, in backward
self.engine.step()
File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 2087, in step
self._take_model_step(lr_kwargs)
File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 1994, in _take_model_step
self.optimizer.step()
File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1735, in step
self._optimizer_step(i)
File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1643, in _optimizer_step
self.optimizer.step()
File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, kwargs)
File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/optim/optimizer.py", line 270, in step
torch.cuda.synchronize()
File "/usr/local/lib/python3.9/dist-packages/torch/cuda/init.py", line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff968140457 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const
, char const
, unsigned int, std::string const&) + 0x64 (0x7ff96810a3ec in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7ff9931aec64 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7ff9931860dc in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x244 (0x7ff993189054 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d8193 (0x7ff9be075193 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7ff9681209e0 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7ff968120af9 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #8: + 0x736508 (0x7ff9be2d3508 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object) + 0x2a5 (0x7ff9be2d37f5 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x624300]
frame #11: /usr/bin/python() [0x56d848]
frame #12: /usr/bin/python() [0x56d8a6]
frame #13: /usr/bin/python() [0x56d8a6]
frame #14: /usr/bin/python() [0x56d8a6]
frame #15: /usr/bin/python() [0x56d8a6]
frame #16: /usr/bin/python() [0x56d8a6]
frame #17: PyDict_SetItemString + 0x531 (0x6039a1 in /usr/bin/python)
frame #18: /usr/bin/python() [0x6c0e2e]
frame #19: Py_FinalizeEx + 0x183 (0x6bab43 in /usr/bin/python)
frame #20: Py_RunMain + 0x16e (0x6f4c3e in /usr/bin/python)
frame #21: Py_BytesMain + 0x2d (0x6f506d in /usr/bin/python)
frame #22: __libc_start_main + 0xf3 (0x7ff9dfca0083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #23: _start + 0x2e (0x630f0e in /usr/bin/python)

@liangxiaonasummer
Copy link

你好请问这个问题解决了么,我也遇到了同样的问题。

@jiaohuix
Copy link

请问解决了吗

@plutoda588
Copy link

plutoda588 commented Oct 23, 2023

我也是,我在单节点2GPU(3张GPU也是类似的错误)情况下也会报类似错误
Traceback (most recent call last):
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 14, in
main()
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 5, in main
run_exp()
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/llmtuner/tuner/tune.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/llmtuner/tuner/sft/workflow.py", line 67, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/opt/conda/envs/wt/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/opt/conda/envs/wt/lib/python3.10/site-packages/transformers/trainer.py", line 1772, in _inner_training_loop
tr_loss = torch.tensor(0.0).to(args.device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
但是在四张卡的情况下我是能够正常跑的。
@hiyouga 求大神指点

@jiaohuix
Copy link

st recent call last):
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 14, in
main()
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 5, in mai

我解决了,就是显存 不够,160g用zero3就不报这个错了。

@plutoda588
Copy link

st recent call last):
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 14, in
main()
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 5, in mai

我解决了,就是显存 不够,160g用zero3就不报这个错了。

好的,谢谢,我怀疑是不是显存不够,但是不确定

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

6 participants