RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) #319

nlp520 · 2019-05-20T14:52:52Z

File "../ptx/fit_extension.py", line 386, in _train_epoch scaled_loss.backward() File "/home/suiguobin/anaconda3/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "../../apex/apex/amp/handle.py", line 125, in scale_loss optimizer._post_amp_backward(loss_scaler) File "../../apex/apex/amp/_process_optimizer.py", line 123, in post_backward_with_master_weights models_are_masters=False) File "../../apex/apex/amp/scaler.py", line 113, in unscale 1./scale) File "../../apex/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__ *args) RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f17e2ce2021 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f17e2ce18ea in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x1805 (0x7f17db4c3a75 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x15a8 (0x7f17db4b8748 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #4: <unknown function> + 0x1784f (0x7f17db4b684f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #5: <unknown function> + 0x14e4f (0x7f17db4b3e4f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) <omitting python frames> frame #54: __libc_start_main + 0xf5 (0x7f1824cc3b45 in /lib/x86_64-linux-gnu/libc.so.6)

I use single card to run the amp, it produced the above error.
However I use more than one cards to train, it doesn't produce ant error.

The text was updated successfully, but these errors were encountered:

mcarilli · 2019-05-21T00:46:54Z

Do you have a minimal code sample that reproduces the error? Also, what is your environment (which pytorch version, which cuda version)?

nlp520 · 2019-05-21T01:38:52Z

compile:
torch.version = 1.1.0
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
from /usr/local/cuda/bin
Pytorch binaries were compiled with Cuda 10.0.130

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

pytorch:1.1.0

nlp520 · 2019-05-21T01:40:43Z

I use the apex to train the bert and it produce error in
with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()

mcarilli · 2019-05-21T16:38:18Z

What optimizer are you using? Also, how are you initializing Amp?

nlp520 · 2019-05-24T08:57:52Z

I use the BertAdam optimizer and initialize the amp
self.model, self.optimizer = amp.initialize(self.model, self.optimizer, opt_level=opt_level)

mcarilli · 2019-05-24T15:38:11Z

Are you using BertAdam from here? Also what value are you using for opt_level?

We've actually got some people right now working on optimizing BERT specifically. I'll let you know if we encounter anything similar.

tatsuhiko-inoue · 2019-05-29T08:31:41Z

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

svedi · 2019-06-02T14:47:28Z

I haven't used Apex/AMP before, so maybe there is some user error here. That said, I also seems to get an error when using a device other than the default device. The code at the end gives me:

RuntimeError: CUDA error: an illegal memory access was encountered

for opt_levels O1 and O2. In particular, I do not seem to get an error for opt_level O3.

Version information:

Apex commit: 8be5b6bedead620db636516d064db39f82052e01(latest commit when I installed it)
torch.version.git_version = '20607a99a31ec5405ca6aa92bc7e7bf768b7bc43' (just installed latest stable using official instructions this morning)
Nvidia driver: 430.14
Running this in docker container based on: nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 (e25e57dde9ade23a377536df339be4d8410a7a7bcddb1e96b0e2db63ac088ed4)

import torch
import torchvision

from apex import amp

device = "cuda:1"
wantIllegalAccessException = True

if __name__ == '__main__':
  if not wantIllegalAccessException:
    torch.cuda.set_device(device)

  model = torchvision.models.resnet34().to(device)
  optimizer = torch.optim.Adam(model.parameters(), 1e-3)
  criterion = torch.nn.CrossEntropyLoss().to(device)

  model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

  input = torch.randn(2, 3, 224, 224, device=device)
  target = torch.randint(0, 999, [input.shape[0]], device=device)

  output = model(input)
  loss = criterion(output, target)

  optimizer.zero_grad()
  with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
  optimizer.step()

ReactiveCJ · 2019-06-17T07:45:23Z

At the scaler.py, there is one line code self._overflow_buf = torch.cuda.IntTensor([0]), which initialize the variable on the default cuda device, if the model is on another device, then we will encounter the error "CUDA error: an illegal memory access was encountered"

mcarilli · 2019-06-19T01:43:30Z

@ReactiveCJ is probably right about the source of the error. However, in general, when using multiple GPUs or manually trying to use a GPU other than the default, it's definitely best practice to call torch.cuda.set_device before you construct your model or call amp.initialize. Calling .to manually on your model is error-prone and might not catch everything (even if you aren't using Amp).

jzazo · 2019-10-09T02:55:43Z

I encountered this problem myself as well, where device = torch.device('cuda:0') works, but device = torch.device('cuda:1') does not.

hadypranoto · 2019-11-08T11:13:58Z

Error occuring randomly, not at epoch happen

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_104508\conda\conda-bld\pytorch_1572950778684\work\aten\src\THC/generic/THCStorage.cpp line=39
error=700 : an illegal memory access was encountered
Traceback (most recent call last):
File "c:/Users/hadypranoto/Latihan/CycleGAN-master/CycleGAN_trainV2.py", line 244, in
D_A_loss.backward()
File "C:\Users\hadypranoto\Anaconda3\envs\tensenv\lib\site-packages\torch\tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\hadypranoto\Anaconda3\envs\tensenv\lib\site-packages\torch\autograd_init_.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at C:\w\1\s\tmp_conda_3.7_104508\conda\conda-bld\pytorch_1572950778684\work\aten\src\THC/generic/THCStorage.cpp:39

sometime i have error like this, occurring randomly

Traceback (most recent call last):
File "c:/Users/hadypranoto/Latihan/CycleGAN-master/CycleGAN_trainV2.py", line 225, in
G_B_loss = MSE_loss(D_A_fake_decision, Variable(torch.ones(D_A_fake_decision.size()).cuda(0)))
RuntimeError: CUDA error: an illegal memory access was encountered

i confusing this is pytorch bugs or my code having bugs..

DuaneNielsen · 2019-12-23T05:54:14Z

Yep, same problem.

device = torch.device('cuda:0') works OK

device = torch.device('cuda:1') fails when calling scaled_loss.backward()

Fixed by a call to torch.cuda.set_device(torch.device('cuda:1'))

I'm guessing somewhere in your code, there are 2 references being kept to different devices.

Can also be fixed by running opt-level O0, so I guess that means it's likely not my code.

middle-chunjie · 2019-12-28T14:50:00Z

You might swap memory in the CPU or other gpus, reboot the cuda or computer, and you might be able to solve the problem

Aria-K-Alethia · 2020-02-05T12:31:38Z

I also encoutered this error.
I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code like p = p.to('cuda:1').
When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.

ZhangMingHui123 · 2020-02-14T07:49:51Z

I also encoutered this error.
I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code like p = p.to('cuda:1').
When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.
@mcarilli @nlp520
i have the same problem,does the apex could not support pytorch's model parallel————
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
（not the DataParallel）

hadypranoto · 2020-02-23T22:56:26Z

Yep, same problem.

device = torch.device('cuda:0') works OK

device = torch.device('cuda:1') fails when calling scaled_loss.backward()

Fixed by a call to torch.cuda.set_device(torch.device('cuda:1'))

I'm guessing somewhere in your code, there are 2 references being kept to different devices.

Can also be fixed by running opt-level O0, so I guess that means it's likely not my code.

I was doing this using others code. The error always part when i create an local variable such

t = torch.zeros(sizeoftensor).cuda()

Its about insufficient memory? Because its happen after certain iteration. Not at the beggining.

tripzero · 2020-03-12T01:00:55Z

seeing this also while running pix2pixHD on two GPUs (with --fp16 argument).

MittalShruti · 2020-03-20T14:07:22Z

setting torch.backends.cudnn.benchmark = False resolves the error for me

tripzero · 2020-03-21T02:00:29Z

setting torch.backends.cudnn.benchmark = False resolves the error for me

Well, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training.

dekura · 2020-03-24T05:59:14Z

setting torch.backends.cudnn.benchmark = False resolves the error for me

Well, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training.

@tripzero same problem, have you found any other solution? thanks~

tripzero · 2020-03-24T17:08:20Z

@dekura no dice. Tried 1 GPU and 2 GPUs. Tried changing optimization level to O2. :(. I can't even reproduce the 100% GPU result I was seeing earlier. Just Illegal Memory Access errors.

matlabninja · 2020-05-18T19:10:44Z

I encountered this issue myself. Did not see error on opt_level 'O0' but did see on opt_level 'O1'. Per the suggestion of @tatsuhiko-inoue, I can use O1 on GPU 1 with the following:
torch.cuda.set_device(1)
device = torch.device('cuda:1')
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
optLv = 'O1'
net.to(device)
net,optimizer = amp.initialize(net,optimizer,opt_level=optLv)

Then train as usual, replacing loss.backward with
with amp.scale_loss(loss,optimizer) as scaled_loss:
scaled_loss.backward()

JianYang93 · 2020-06-04T01:56:18Z

@hadypranoto I encountered the same problem. Have you figured out why and how to solve it? Thanks!

ll0ruc · 2020-06-08T10:47:41Z

@JianYang93 @matlabninja @tripzero
Traceback (most recent call last):
File "train.py", line 104, in
train(model, train_iter, optimizer, criterion)
File "train.py", line 28, in train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (embedding_dense_backward_cuda at /pytorch/aten/src/ATen/native/cuda/Embedding.cu:267)
I also came across this problem and hope to get help.

JianYang93 · 2020-06-08T10:53:22Z

@ll0iecas Sorry I am in no way an expert on this and I encountered this error not in this particular package. FYI my problem was because of too large batch size.

BramVanroy · 2020-06-08T12:47:56Z

@ll0iecas Did you explicitly set your device?

torch.cuda.set_device(device)

ll0ruc · 2020-06-10T01:15:02Z

@ll0iecas Did you explicitly set your device?
torch.cuda.set_device(device)

I did, but nothing worked

LeMei · 2020-08-20T08:16:56Z

@ll0iecas Did you explicitly set your device?
torch.cuda.set_device(device)
I did, but nothing worked

Hello, I also got this error, and I have no idea to fix it. I explicitly set device but it does't work.

cherepashkin0 · 2020-11-27T10:02:11Z

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

What do you mean, how to specify GPU for each process? Do you write torch.cuda.set_device() after each new variable is created?

ZhangJT0127 · 2021-05-29T13:36:43Z

@ll0iecas Did you explicitly set your device?
torch.cuda.set_device(device)
I did, but nothing worked

hello,i also meet this error,did you solve it?

jeffdaily · 2021-10-01T23:15:20Z

For anyone here encountering a fault, are any of your input tensors to the multi tensor apply 0-sized, i.e. numel() == 0?

mcarilli added the BERT label May 24, 2019

mcarilli mentioned this issue Jun 19, 2019

Must mixed precision training be combined with distributed training? #365

Open

BramVanroy mentioned this issue Oct 8, 2019

Model parallel: an illegal memory access was encountered #371

Open

jzazo mentioned this issue Nov 1, 2019

RuntimeError: CUDA error: an illegal memory access was encountered pytorch/pytorch#21819

Closed

tripzero mentioned this issue Mar 21, 2020

train error with apex NVIDIA/pix2pixHD#182

Open

ProGamerGov mentioned this issue Mar 29, 2020

Issues w/ running on Windows ProGamerGov/neural-dream#1

Open

BramVanroy mentioned this issue Jun 2, 2020

Possible fix to make AMP work with DDP in the trainer huggingface/transformers#4728

Merged

prajjwal1 mentioned this issue Jun 15, 2020

Illegal memory access (cudaErrorIllegalAddress) huggingface/transformers#5002

Closed

JiushengChen mentioned this issue Jul 24, 2020

Illegal memory access when batch_size is between (128, 256) microsoft/fastseq#2

Closed

stas00 mentioned this issue Feb 18, 2022

[apex FusedAdam] crash workaround bigscience-workshop/Megatron-DeepSpeed#249

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) #319

RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) #319

nlp520 commented May 20, 2019

mcarilli commented May 21, 2019

nlp520 commented May 21, 2019

nlp520 commented May 21, 2019

mcarilli commented May 21, 2019

nlp520 commented May 24, 2019

mcarilli commented May 24, 2019 •

edited

Loading

tatsuhiko-inoue commented May 29, 2019

svedi commented Jun 2, 2019

ReactiveCJ commented Jun 17, 2019

mcarilli commented Jun 19, 2019

jzazo commented Oct 9, 2019

hadypranoto commented Nov 8, 2019

DuaneNielsen commented Dec 23, 2019 •

edited

Loading

middle-chunjie commented Dec 28, 2019

Aria-K-Alethia commented Feb 5, 2020 •

edited

Loading

ZhangMingHui123 commented Feb 14, 2020

hadypranoto commented Feb 23, 2020

tripzero commented Mar 12, 2020

MittalShruti commented Mar 20, 2020

tripzero commented Mar 21, 2020

dekura commented Mar 24, 2020

tripzero commented Mar 24, 2020

matlabninja commented May 18, 2020 •

edited

Loading

JianYang93 commented Jun 4, 2020

ll0ruc commented Jun 8, 2020

JianYang93 commented Jun 8, 2020

BramVanroy commented Jun 8, 2020 •

edited

Loading

ll0ruc commented Jun 10, 2020

LeMei commented Aug 20, 2020

cherepashkin0 commented Nov 27, 2020

ZhangJT0127 commented May 29, 2021

jeffdaily commented Oct 1, 2021

RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) #319

RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) #319

Comments

nlp520 commented May 20, 2019

mcarilli commented May 21, 2019

nlp520 commented May 21, 2019

nlp520 commented May 21, 2019

mcarilli commented May 21, 2019

nlp520 commented May 24, 2019

mcarilli commented May 24, 2019 • edited Loading

tatsuhiko-inoue commented May 29, 2019

svedi commented Jun 2, 2019

ReactiveCJ commented Jun 17, 2019

mcarilli commented Jun 19, 2019

jzazo commented Oct 9, 2019

hadypranoto commented Nov 8, 2019

DuaneNielsen commented Dec 23, 2019 • edited Loading

middle-chunjie commented Dec 28, 2019

Aria-K-Alethia commented Feb 5, 2020 • edited Loading

ZhangMingHui123 commented Feb 14, 2020

hadypranoto commented Feb 23, 2020

tripzero commented Mar 12, 2020

MittalShruti commented Mar 20, 2020

tripzero commented Mar 21, 2020

dekura commented Mar 24, 2020

tripzero commented Mar 24, 2020

matlabninja commented May 18, 2020 • edited Loading

JianYang93 commented Jun 4, 2020

ll0ruc commented Jun 8, 2020

JianYang93 commented Jun 8, 2020

BramVanroy commented Jun 8, 2020 • edited Loading

ll0ruc commented Jun 10, 2020

LeMei commented Aug 20, 2020

cherepashkin0 commented Nov 27, 2020

ZhangJT0127 commented May 29, 2021

jeffdaily commented Oct 1, 2021

mcarilli commented May 24, 2019 •

edited

Loading

DuaneNielsen commented Dec 23, 2019 •

edited

Loading

Aria-K-Alethia commented Feb 5, 2020 •

edited

Loading

matlabninja commented May 18, 2020 •

edited

Loading

BramVanroy commented Jun 8, 2020 •

edited

Loading