You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In HF/DS tests I use a notebook-like environment for some of the tests, which means I don't fork a new process for each deepspeed run, but emulate the distributed env in the process and repeatedly run deepspeed. This makes it much easier/faster to test the values of the weights.
There must be either some global variable that holds onto memory, or the circular reference happening, since memory doesn't get released after deepspeed has done its work and I even explicitly deleted the engine/scheduler/optimizer variables.
I tested that if I remove deepspeed from the equation there is no leak.
Now I have a few dozens of those tests and I get some 10GB extra RAM used per deepspeed invocation. So it quickly grows to 100GBs.
This is with a test model of 2x 1 value weights y = wx+b - i.e. the model's memory foot print is ~0.
Any idea where the objects might be held and not destroyed? If I repeat the same re-creation of the trainer in a loop the old one should get its memory freed up.
So this is an example of a cell in a jupyter notebook:
In HF/DS tests I use a notebook-like environment for some of the tests, which means I don't fork a new process for each deepspeed run, but emulate the distributed env in the process and repeatedly run deepspeed. This makes it much easier/faster to test the values of the weights.
There must be either some global variable that holds onto memory, or the circular reference happening, since memory doesn't get released after deepspeed has done its work and I even explicitly deleted the engine/scheduler/optimizer variables.
I tested that if I remove deepspeed from the equation there is no leak.
Now I have a few dozens of those tests and I get some 10GB extra RAM used per deepspeed invocation. So it quickly grows to 100GBs.
This is with a test model of 2x 1 value weights y = wx+b - i.e. the model's memory foot print is ~0.
Any idea where the objects might be held and not destroyed? If I repeat the same re-creation of the trainer in a loop the old one should get its memory freed up.
So this is an example of a cell in a jupyter notebook:
of course the explicit
None
s should be needed as they should get overwritten by the new trainer, I was just doing a sanity check.So there must be a circular reference where 2 or more internal variables refer to each other and thus the memory doesn't get freed.
I emulate a dist env with just:
I'd be happy to try to investigate this on my own if you could help me with pointers at potential suspects.
Thank you!
@jeffra, @samyam
The text was updated successfully, but these errors were encountered: