-
Notifications
You must be signed in to change notification settings - Fork 26.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed torch.cuda.empty_cache from train loop. #31530
Removed torch.cuda.empty_cache from train loop. #31530
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your investigation @FoamoftheSea ! LGTM !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall agree with it, if users decide this isn't enough, the next step IMO would be a toggle-able "after n steps" do an empty_cache()
or some sort, to at least delay it and give users control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed PR description @FoamoftheSea and the fix ❤️
Agreed - let's remove for now and if we find that users need it we can think about smarter ways to reintroduce
Does it make sense if its added as a This may also cause some behaviour changes where hyperparams/models working previously OOMs after the change |
@aliencaocao I'm not sure we necessarily want to actively monitor the memory and trigger a tip (I suspect this is more fiddly and flaky than expected as you have to balance catching in time vs not spamming, making sure values are correct etc.). A flag which we can configure for clearing after every n-steps seems reasonable. Would you like to open a PR with a proposal and we can iterate from there? |
sure i can do it |
What does this PR do?
Removes the addition of torch.cuda.empty_cache from the training loop (introduced in #28769).
This line caused training slowdowns observed in issue #31372
While this thread in the PyTorch forums recommends not to use this function because it is slow, it appears many in the comments there still find it necessary to save them from OOMs on their training jobs, so it might be nice to have the option, but users can just add it on their own if they're in a jam.
Fixes # 31372
@muellerz @SunMarc