-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illegal memory access (cudaErrorIllegalAddress) #5002
Comments
Reducing the batch size further doesn't raise this error. But a lot of RAM is left empty. If this is an issue, then RAM demand exceeded error should be raised. |
I've seen this error, and think it happens right before |
I don't think it's an Apex issue also because I ran my code without fp16 integration earlier. Mostly a pytorch issue. I am not sure how RAM usage exceeds in such a short time. Initially, 10 Gigs of RAM is left and suddenly this error pops up. Halving the batch size helped but there are no signs of memory leakage. Not really sure what's happening. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
🐛 Bug
Information
This bug/problem has been discussed on Pytorch/Apex and here (bot marked it as stale) also.
I'm using Albert on GLUE (although this issue is model/dataset agnostic).
I've made a slight modifications in my train loop (as compared to
train()
inTrainer()
.The main one which throws this error is when I compute the gradients:
where loss is simply
model(**inputs)[0]
I'm using Pytorch 1.5.0+cu101, transformers 2.11 on one GPU, no multiGPU, although the instance has 2 by (CUDA_VISIBLE_DEVICES=0). I tried with
torch.cuda.set_device()
also.Can you suggest a workaround ?
The text was updated successfully, but these errors were encountered: