Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) #319

Open
nlp520 opened this issue May 20, 2019 · 32 comments
Labels

Comments

@nlp520
Copy link

nlp520 commented May 20, 2019

File "../ptx/fit_extension.py", line 386, in _train_epoch scaled_loss.backward() File "/home/suiguobin/anaconda3/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "../../apex/apex/amp/handle.py", line 125, in scale_loss optimizer._post_amp_backward(loss_scaler) File "../../apex/apex/amp/_process_optimizer.py", line 123, in post_backward_with_master_weights models_are_masters=False) File "../../apex/apex/amp/scaler.py", line 113, in unscale 1./scale) File "../../apex/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__ *args) RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f17e2ce2021 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f17e2ce18ea in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x1805 (0x7f17db4c3a75 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x15a8 (0x7f17db4b8748 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #4: <unknown function> + 0x1784f (0x7f17db4b684f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #5: <unknown function> + 0x14e4f (0x7f17db4b3e4f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) <omitting python frames> frame #54: __libc_start_main + 0xf5 (0x7f1824cc3b45 in /lib/x86_64-linux-gnu/libc.so.6)

I use single card to run the amp, it produced the above error.
However I use more than one cards to train, it doesn't produce ant error.

@mcarilli
Copy link
Contributor

Do you have a minimal code sample that reproduces the error? Also, what is your environment (which pytorch version, which cuda version)?

@nlp520
Copy link
Author

nlp520 commented May 21, 2019

compile:
torch.version = 1.1.0
Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
from /usr/local/cuda/bin
Pytorch binaries were compiled with Cuda 10.0.130

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

pytorch:1.1.0

@nlp520
Copy link
Author

nlp520 commented May 21, 2019

I use the apex to train the bert and it produce error in
with amp.scale_loss(loss, optimizer) as scaled_loss: scaled_loss.backward()

@mcarilli
Copy link
Contributor

What optimizer are you using? Also, how are you initializing Amp?

@nlp520
Copy link
Author

nlp520 commented May 24, 2019

I use the BertAdam optimizer and initialize the amp
self.model, self.optimizer = amp.initialize(self.model, self.optimizer, opt_level=opt_level)

@mcarilli
Copy link
Contributor

mcarilli commented May 24, 2019

Are you using BertAdam from here? Also what value are you using for opt_level?

We've actually got some people right now working on optimizing BERT specifically. I'll let you know if we encounter anything similar.

@mcarilli mcarilli added the BERT label May 24, 2019
@tatsuhiko-inoue
Copy link

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

@svedi
Copy link

svedi commented Jun 2, 2019

I haven't used Apex/AMP before, so maybe there is some user error here. That said, I also seems to get an error when using a device other than the default device. The code at the end gives me:

RuntimeError: CUDA error: an illegal memory access was encountered

for opt_levels O1 and O2. In particular, I do not seem to get an error for opt_level O3.

Version information:

  • Apex commit: 8be5b6bedead620db636516d064db39f82052e01(latest commit when I installed it)
  • torch.version.git_version = '20607a99a31ec5405ca6aa92bc7e7bf768b7bc43' (just installed latest stable using official instructions this morning)
  • Nvidia driver: 430.14
  • Running this in docker container based on: nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04 (e25e57dde9ade23a377536df339be4d8410a7a7bcddb1e96b0e2db63ac088ed4)
import torch
import torchvision

from apex import amp

device = "cuda:1"
wantIllegalAccessException = True

if __name__ == '__main__':
  if not wantIllegalAccessException:
    torch.cuda.set_device(device)

  model = torchvision.models.resnet34().to(device)
  optimizer = torch.optim.Adam(model.parameters(), 1e-3)
  criterion = torch.nn.CrossEntropyLoss().to(device)

  model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

  input = torch.randn(2, 3, 224, 224, device=device)
  target = torch.randint(0, 999, [input.shape[0]], device=device)

  output = model(input)
  loss = criterion(output, target)

  optimizer.zero_grad()
  with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
  optimizer.step()

@ReactiveCJ
Copy link

At the scaler.py, there is one line code self._overflow_buf = torch.cuda.IntTensor([0]), which initialize the variable on the default cuda device, if the model is on another device, then we will encounter the error "CUDA error: an illegal memory access was encountered"

@mcarilli
Copy link
Contributor

@ReactiveCJ is probably right about the source of the error. However, in general, when using multiple GPUs or manually trying to use a GPU other than the default, it's definitely best practice to call torch.cuda.set_device before you construct your model or call amp.initialize. Calling .to manually on your model is error-prone and might not catch everything (even if you aren't using Amp).

@jzazo
Copy link

jzazo commented Oct 9, 2019

I encountered this problem myself as well, where device = torch.device('cuda:0') works, but device = torch.device('cuda:1') does not.

@hadypranoto
Copy link

Error occuring randomly, not at epoch happen

THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_104508\conda\conda-bld\pytorch_1572950778684\work\aten\src\THC/generic/THCStorage.cpp line=39
error=700 : an illegal memory access was encountered
Traceback (most recent call last):
File "c:/Users/hadypranoto/Latihan/CycleGAN-master/CycleGAN_trainV2.py", line 244, in
D_A_loss.backward()
File "C:\Users\hadypranoto\Anaconda3\envs\tensenv\lib\site-packages\torch\tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\hadypranoto\Anaconda3\envs\tensenv\lib\site-packages\torch\autograd_init_.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuda runtime error (700) : an illegal memory access was encountered at C:\w\1\s\tmp_conda_3.7_104508\conda\conda-bld\pytorch_1572950778684\work\aten\src\THC/generic/THCStorage.cpp:39

sometime i have error like this, occurring randomly

Traceback (most recent call last):
File "c:/Users/hadypranoto/Latihan/CycleGAN-master/CycleGAN_trainV2.py", line 225, in
G_B_loss = MSE_loss(D_A_fake_decision, Variable(torch.ones(D_A_fake_decision.size()).cuda(0)))
RuntimeError: CUDA error: an illegal memory access was encountered

i confusing this is pytorch bugs or my code having bugs..

@DuaneNielsen
Copy link

DuaneNielsen commented Dec 23, 2019

Yep, same problem.

device = torch.device('cuda:0') works OK

device = torch.device('cuda:1') fails when calling scaled_loss.backward()

Fixed by a call to torch.cuda.set_device(torch.device('cuda:1'))

I'm guessing somewhere in your code, there are 2 references being kept to different devices.

Can also be fixed by running opt-level O0, so I guess that means it's likely not my code.

@middle-chunjie
Copy link

You might swap memory in the CPU or other gpus, reboot the cuda or computer, and you might be able to solve the problem

@Aria-K-Alethia
Copy link

Aria-K-Alethia commented Feb 5, 2020

I also encoutered this error.
I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code like p = p.to('cuda:1').
When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.

@ZhangMingHui123
Copy link

I also encoutered this error.
I think it may due to I used multiple GPU. One of a module of my model is placed on another GPU, and I transfer my data to other GPU manully by using code like p = p.to('cuda:1').
When I delete the amp code, the problem is fixed. Seems apex could not support such setting well.
@mcarilli @nlp520
i have the same problem,does the apex could not support pytorch's model parallel————
https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html
(not the DataParallel)

@hadypranoto
Copy link

Yep, same problem.

device = torch.device('cuda:0') works OK

device = torch.device('cuda:1') fails when calling scaled_loss.backward()

Fixed by a call to torch.cuda.set_device(torch.device('cuda:1'))

I'm guessing somewhere in your code, there are 2 references being kept to different devices.

Can also be fixed by running opt-level O0, so I guess that means it's likely not my code.

I was doing this using others code. The error always part when i create an local variable such

t = torch.zeros(sizeoftensor).cuda()

Its about insufficient memory? Because its happen after certain iteration. Not at the beggining.

@tripzero
Copy link

seeing this also while running pix2pixHD on two GPUs (with --fp16 argument).

@MittalShruti
Copy link

setting torch.backends.cudnn.benchmark = False resolves the error for me

@tripzero
Copy link

setting torch.backends.cudnn.benchmark = False resolves the error for me

Well, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training.

@dekura
Copy link

dekura commented Mar 24, 2020

setting torch.backends.cudnn.benchmark = False resolves the error for me

Well, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training.

@tripzero same problem, have you found any other solution? thanks~

@tripzero
Copy link

@dekura no dice. Tried 1 GPU and 2 GPUs. Tried changing optimization level to O2. :(. I can't even reproduce the 100% GPU result I was seeing earlier. Just Illegal Memory Access errors.

@matlabninja
Copy link

matlabninja commented May 18, 2020

I encountered this issue myself. Did not see error on opt_level 'O0' but did see on opt_level 'O1'. Per the suggestion of @tatsuhiko-inoue, I can use O1 on GPU 1 with the following:
torch.cuda.set_device(1)
device = torch.device('cuda:1')
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
optLv = 'O1'
net.to(device)
net,optimizer = amp.initialize(net,optimizer,opt_level=optLv)

Then train as usual, replacing loss.backward with
with amp.scale_loss(loss,optimizer) as scaled_loss:
scaled_loss.backward()

@JianYang93
Copy link

@hadypranoto I encountered the same problem. Have you figured out why and how to solve it? Thanks!

@ll0ruc
Copy link

ll0ruc commented Jun 8, 2020

@JianYang93 @matlabninja @tripzero
Traceback (most recent call last):
File "train.py", line 104, in
train(model, train_iter, optimizer, criterion)
File "train.py", line 28, in train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (embedding_dense_backward_cuda at /pytorch/aten/src/ATen/native/cuda/Embedding.cu:267)
I also came across this problem and hope to get help.

@JianYang93
Copy link

@ll0iecas Sorry I am in no way an expert on this and I encountered this error not in this particular package. FYI my problem was because of too large batch size.

@BramVanroy
Copy link
Contributor

BramVanroy commented Jun 8, 2020

@ll0iecas Did you explicitly set your device?

torch.cuda.set_device(device)

@ll0ruc
Copy link

ll0ruc commented Jun 10, 2020

@ll0iecas Did you explicitly set your device?

torch.cuda.set_device(device)

I did, but nothing worked

@LeMei
Copy link

LeMei commented Aug 20, 2020

@ll0iecas Did you explicitly set your device?

torch.cuda.set_device(device)

I did, but nothing worked

Hello, I also got this error, and I have no idea to fix it. I explicitly set device but it does't work.

@cherepashkin0
Copy link

I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error.

What do you mean, how to specify GPU for each process? Do you write torch.cuda.set_device() after each new variable is created?

@ZhangJT0127
Copy link

@ll0iecas Did you explicitly set your device?

torch.cuda.set_device(device)

I did, but nothing worked

hello,i also meet this error,did you solve it?

@jeffdaily
Copy link

For anyone here encountering a fault, are any of your input tensors to the multi tensor apply 0-sized, i.e. numel() == 0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests