-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory error (only after first epoch) #80
Comments
As a follow-up, the full
and the output after:
Incidentally, I also noticed that it throws the out of memory error on GPU 3, where there is also an UPDATE: It may actually be a problem! I re-ran the training algo but with only 3 GPUs used (ignoring the one that has |
Following up from #79, instead of getting stuck on evaluation anymore (yay), it now reports a CUDA out of memory error after running the first epoch:
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 10.73 GiB total capacity; 6.34 GiB already allocated; 25.62 MiB free; 6.44 GiB reserved in total by PyTorch)
This interestingly only happens after the first epoch, and after I load from checkpoint of epoch 0.
Things already done:
nvidia-smi
) so that when the program starts, all 4 x 11GB of memory is freetorch.cuda.empty_cache()
before training, which doesn't seem to help.Thank you!
If it helps, the full stack trace is below:
The text was updated successfully, but these errors were encountered: