单节点多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

DBtxy · 2023-07-27T13:06:09Z

环境
python 3.9
torch 2.0cu117
deepspeed 0.9.5

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x56 (0x7eff2a3fb946 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x78 (0x7eff2a3f7af8 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x354 (0x7eff2a6901b4 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x87 (0x7eff2ba52767 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x50 (0x7eff2ba54500 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x132 (0x7eff2ba540f2 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xceed0 (0x7eff83801ed0 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x7ea7 (0x7effa1fe0ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #8: clone + 0x3f (0x7effa1d63a2f in /lib/x86_64-linux-gnu/libc.so.6)

Traceback (most recent call last):
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/train_bash.py", line 22, in
main()
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/train_bash.py", line 9, in main
run_sft(model_args, data_args, training_args, finetuning_args)
File "/opt/tiger/bigdecoder/LLaMA-Efficient-Tuning/src/llmtuner/tuner/sft/workflow.py", line 61, in run_sft
train_result = trainer.train()
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/tiger/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1198, in prepare
result = self._prepare_deepspeed(*args)
File "/home/tiger/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/init.py", line 165, in initialize
engine = DeepSpeedEngine(args=args,
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 309, in init
self._configure_optimizer(optimizer, model_parameters)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1184, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1419, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 484, in init
self.initialize_optimizer_states()
File "/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 619, in initialize_optimizer_states
self.optimizer.step()
File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py", line 33, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 171, in step
adamw(
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 321, in adamw
func(
File "/usr/local/lib/python3.9/dist-packages/torch/optim/adamw.py", line 502, in _multi_tensor_adamw
torch.foreach_mul(device_exp_avg_sqs, beta2)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2417 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2415) of binary: /usr/bin/python3

The text was updated successfully, but these errors were encountered:

YananSunn · 2023-08-31T02:03:48Z

请问解决了吗？我在2个8卡节点上，通过指定device和修改deepspeed.json，跑2节点4卡的实验（QWen，全量微调），也会报类似的错误。
File "/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py", line 1847, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/accelerate/utils/deepspeed.py", line 176, in backward
self.engine.step()
File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 2087, in step
self._take_model_step(lr_kwargs)
File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 1994, in _take_model_step
self.optimizer.step()
File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1735, in step
self._optimizer_step(i)
File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1643, in _optimizer_step
self.optimizer.step()
File "/usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, *kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(args, kwargs)
File "/usr/local/lib/python3.9/dist-packages/bitsandbytes/optim/optimizer.py", line 270, in step
torch.cuda.synchronize()
File "/usr/local/lib/python3.9/dist-packages/torch/cuda/init.py", line 566, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff968140457 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7ff96810a3ec in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7ff9931aec64 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e0dc (0x7ff9931860dc in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x244 (0x7ff993189054 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d8193 (0x7ff9be075193 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7ff9681209e0 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7ff968120af9 in /usr/local/lib/python3.9/dist-packages/torch/lib/libc10.so)
frame #8: + 0x736508 (0x7ff9be2d3508 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object) + 0x2a5 (0x7ff9be2d37f5 in /usr/local/lib/python3.9/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x624300]
frame #11: /usr/bin/python() [0x56d848]
frame #12: /usr/bin/python() [0x56d8a6]
frame #13: /usr/bin/python() [0x56d8a6]
frame #14: /usr/bin/python() [0x56d8a6]
frame #15: /usr/bin/python() [0x56d8a6]
frame #16: /usr/bin/python() [0x56d8a6]
frame #17: PyDict_SetItemString + 0x531 (0x6039a1 in /usr/bin/python)
frame #18: /usr/bin/python() [0x6c0e2e]
frame #19: Py_FinalizeEx + 0x183 (0x6bab43 in /usr/bin/python)
frame #20: Py_RunMain + 0x16e (0x6f4c3e in /usr/bin/python)
frame #21: Py_BytesMain + 0x2d (0x6f506d in /usr/bin/python)
frame #22: __libc_start_main + 0xf3 (0x7ff9dfca0083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #23: _start + 0x2e (0x630f0e in /usr/bin/python)

liangxiaonasummer · 2023-09-01T00:59:28Z

你好请问这个问题解决了么，我也遇到了同样的问题。

jiaohuix · 2023-09-27T17:28:41Z

请问解决了吗

plutoda588 · 2023-10-23T13:25:05Z

我也是，我在单节点2GPU（3张GPU也是类似的错误）情况下也会报类似错误
Traceback (most recent call last):
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 14, in
main()
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 5, in main
run_exp()
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/llmtuner/tuner/tune.py", line 26, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/llmtuner/tuner/sft/workflow.py", line 67, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/opt/conda/envs/wt/lib/python3.10/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/opt/conda/envs/wt/lib/python3.10/site-packages/transformers/trainer.py", line 1772, in _inner_training_loop
tr_loss = torch.tensor(0.0).to(args.device)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
但是在四张卡的情况下我是能够正常跑的。
@hiyouga 求大神指点

jiaohuix · 2023-10-23T13:29:36Z

st recent call last):
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 14, in
main()
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 5, in mai

我解决了，就是显存不够，160g用zero3就不报这个错了。

plutoda588 · 2023-10-23T13:31:51Z

st recent call last):
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 14, in
main()
File "/T106/model_main/LLaMA-Factory-main/LLaMA-Factory-main/src/train_bash.py", line 5, in mai

我解决了，就是显存不够，160g用zero3就不报这个错了。

好的，谢谢，我怀疑是不是显存不够，但是不确定

hiyouga added the pending This problem is yet to be addressed label Jul 27, 2023

hiyouga added invalid This doesn't seem right and removed pending This problem is yet to be addressed labels Aug 11, 2023

hiyouga closed this as completed Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

单节点多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

单节点多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

DBtxy commented Jul 27, 2023

YananSunn commented Aug 31, 2023

liangxiaonasummer commented Sep 1, 2023

jiaohuix commented Sep 27, 2023

plutoda588 commented Oct 23, 2023 •

edited

Loading

jiaohuix commented Oct 23, 2023

plutoda588 commented Oct 23, 2023

单节点 多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

单节点 多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

Comments

DBtxy commented Jul 27, 2023

YananSunn commented Aug 31, 2023

liangxiaonasummer commented Sep 1, 2023

jiaohuix commented Sep 27, 2023

plutoda588 commented Oct 23, 2023 • edited Loading

jiaohuix commented Oct 23, 2023

plutoda588 commented Oct 23, 2023

单节点多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

单节点多卡A100 全量微调 CUDA error: an illegal memory access was encountered #267

plutoda588 commented Oct 23, 2023 •

edited

Loading