-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) #319
Comments
Do you have a minimal code sample that reproduces the error? Also, what is your environment (which pytorch version, which cuda version)? |
compile:
|
I use the apex to train the bert and it produce error in |
What optimizer are you using? Also, how are you initializing Amp? |
I use the BertAdam optimizer and initialize the amp |
Are you using BertAdam from here? Also what value are you using for opt_level? We've actually got some people right now working on optimizing BERT specifically. I'll let you know if we encounter anything similar. |
I also encountered a similar error. I specified the default GPU for each process with torch.cuda.set_device(), and I was able to avoid this error. |
I haven't used Apex/AMP before, so maybe there is some user error here. That said, I also seems to get an error when using a device other than the default device. The code at the end gives me:
for opt_levels Version information:
|
At the scaler.py, there is one line code self._overflow_buf = torch.cuda.IntTensor([0]), which initialize the variable on the default cuda device, if the model is on another device, then we will encounter the error "CUDA error: an illegal memory access was encountered" |
@ReactiveCJ is probably right about the source of the error. However, in general, when using multiple GPUs or manually trying to use a GPU other than the default, it's definitely best practice to call torch.cuda.set_device before you construct your model or call amp.initialize. Calling .to manually on your model is error-prone and might not catch everything (even if you aren't using Amp). |
I encountered this problem myself as well, where |
Error occuring randomly, not at epoch happen THCudaCheck FAIL file=C:\w\1\s\tmp_conda_3.7_104508\conda\conda-bld\pytorch_1572950778684\work\aten\src\THC/generic/THCStorage.cpp line=39 sometime i have error like this, occurring randomly Traceback (most recent call last): i confusing this is pytorch bugs or my code having bugs.. |
Yep, same problem. device = torch.device('cuda:0') works OK device = torch.device('cuda:1') fails when calling scaled_loss.backward() Fixed by a call to torch.cuda.set_device(torch.device('cuda:1')) I'm guessing somewhere in your code, there are 2 references being kept to different devices. Can also be fixed by running opt-level O0, so I guess that means it's likely not my code. |
You might swap memory in the CPU or other gpus, reboot the cuda or computer, and you might be able to solve the problem |
I also encoutered this error. |
|
I was doing this using others code. The error always part when i create an local variable such t = torch.zeros(sizeoftensor).cuda() Its about insufficient memory? Because its happen after certain iteration. Not at the beggining. |
seeing this also while running pix2pixHD on two GPUs (with --fp16 argument). |
setting |
Well, pix2pixHD doesn't crash anymore with this added... but it just locks up one of the GPUs at 100% doing something other than training. |
@tripzero same problem, have you found any other solution? thanks~ |
@dekura no dice. Tried 1 GPU and 2 GPUs. Tried changing optimization level to O2. :(. I can't even reproduce the 100% GPU result I was seeing earlier. Just Illegal Memory Access errors. |
I encountered this issue myself. Did not see error on opt_level 'O0' but did see on opt_level 'O1'. Per the suggestion of @tatsuhiko-inoue, I can use O1 on GPU 1 with the following: Then train as usual, replacing loss.backward with |
@hadypranoto I encountered the same problem. Have you figured out why and how to solve it? Thanks! |
@JianYang93 @matlabninja @tripzero |
@ll0iecas Sorry I am in no way an expert on this and I encountered this error not in this particular package. FYI my problem was because of too large batch size. |
@ll0iecas Did you explicitly set your device? torch.cuda.set_device(device) |
I did, but nothing worked |
Hello, I also got this error, and I have no idea to fix it. I explicitly set device but it does't work. |
What do you mean, how to specify GPU for each process? Do you write torch.cuda.set_device() after each new variable is created? |
hello,i also meet this error,did you solve it? |
For anyone here encountering a fault, are any of your input tensors to the multi tensor apply 0-sized, i.e. numel() == 0? |
File "../ptx/fit_extension.py", line 386, in _train_epoch scaled_loss.backward() File "/home/suiguobin/anaconda3/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "../../apex/apex/amp/handle.py", line 125, in scale_loss optimizer._post_amp_backward(loss_scaler) File "../../apex/apex/amp/_process_optimizer.py", line 123, in post_backward_with_master_weights models_are_masters=False) File "../../apex/apex/amp/scaler.py", line 113, in unscale 1./scale) File "../../apex/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in __call__ *args) RuntimeError: CUDA error: an illegal memory access was encountered (multi_tensor_apply at csrc/multi_tensor_apply.cuh:101) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f17e2ce2021 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f17e2ce18ea in /home/suiguobin/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: void multi_tensor_apply<2, ScaleFunctor<c10::Half, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > > const&, ScaleFunctor<c10::Half, float>, float) + 0x1805 (0x7f17db4c3a75 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #3: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float) + 0x15a8 (0x7f17db4b8748 in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #4: <unknown function> + 0x1784f (0x7f17db4b684f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) frame #5: <unknown function> + 0x14e4f (0x7f17db4b3e4f in /home/suiguobin/anaconda3/lib/python3.6/site-packages/apex-0.1-py3.6-linux-x86_64.egg/amp_C.cpython-36m-x86_64-linux-gnu.so) <omitting python frames> frame #54: __libc_start_main + 0xf5 (0x7f1824cc3b45 in /lib/x86_64-linux-gnu/libc.so.6)
I use single card to run the amp, it produced the above error.
However I use more than one cards to train, it doesn't produce ant error.
The text was updated successfully, but these errors were encountered: