Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak in an environment like notebook #879

Closed
stas00 opened this issue Mar 19, 2021 · 0 comments · Fixed by #896
Closed

memory leak in an environment like notebook #879

stas00 opened this issue Mar 19, 2021 · 0 comments · Fixed by #896

Comments

@stas00
Copy link
Collaborator

stas00 commented Mar 19, 2021

In HF/DS tests I use a notebook-like environment for some of the tests, which means I don't fork a new process for each deepspeed run, but emulate the distributed env in the process and repeatedly run deepspeed. This makes it much easier/faster to test the values of the weights.

There must be either some global variable that holds onto memory, or the circular reference happening, since memory doesn't get released after deepspeed has done its work and I even explicitly deleted the engine/scheduler/optimizer variables.

I tested that if I remove deepspeed from the equation there is no leak.

Now I have a few dozens of those tests and I get some 10GB extra RAM used per deepspeed invocation. So it quickly grows to 100GBs.

This is with a test model of 2x 1 value weights y = wx+b - i.e. the model's memory foot print is ~0.

Any idea where the objects might be held and not destroyed? If I repeat the same re-creation of the trainer in a loop the old one should get its memory freed up.

So this is an example of a cell in a jupyter notebook:

if trainer.deepspeed:
    print("reloading")
    trainer.deepspeed = None
    trainer.optimizer = None
    trainer.lr_scheduler = None
trainer = get_regression_trainer(output_dir=output_dir, deepspeed=ds_config_dict, skip_memory_metrics=True)
trainer.train()

of course the explicit Nones should be needed as they should get overwritten by the new trainer, I was just doing a sanity check.

So there must be a circular reference where 2 or more internal variables refer to each other and thus the memory doesn't get freed.

I emulate a dist env with just:

dist_env_1_gpu = dict(
            MASTER_ADDR="localhost", MASTER_PORT="10999", RANK="0", LOCAL_RANK="0", WORLD_SIZE="1"
        )
for k,v in dist_env_1_gpu.items():
    os.environ[k]=v

I'd be happy to try to investigate this on my own if you could help me with pointers at potential suspects.

Thank you!

@jeffra, @samyam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant