memory leak in an environment like notebook #879

stas00 · 2021-03-19T18:24:49Z

In HF/DS tests I use a notebook-like environment for some of the tests, which means I don't fork a new process for each deepspeed run, but emulate the distributed env in the process and repeatedly run deepspeed. This makes it much easier/faster to test the values of the weights.

There must be either some global variable that holds onto memory, or the circular reference happening, since memory doesn't get released after deepspeed has done its work and I even explicitly deleted the engine/scheduler/optimizer variables.

I tested that if I remove deepspeed from the equation there is no leak.

Now I have a few dozens of those tests and I get some 10GB extra RAM used per deepspeed invocation. So it quickly grows to 100GBs.

This is with a test model of 2x 1 value weights y = wx+b - i.e. the model's memory foot print is ~0.

Any idea where the objects might be held and not destroyed? If I repeat the same re-creation of the trainer in a loop the old one should get its memory freed up.

So this is an example of a cell in a jupyter notebook:

if trainer.deepspeed:
    print("reloading")
    trainer.deepspeed = None
    trainer.optimizer = None
    trainer.lr_scheduler = None
trainer = get_regression_trainer(output_dir=output_dir, deepspeed=ds_config_dict, skip_memory_metrics=True)
trainer.train()

of course the explicit Nones should be needed as they should get overwritten by the new trainer, I was just doing a sanity check.

So there must be a circular reference where 2 or more internal variables refer to each other and thus the memory doesn't get freed.

I emulate a dist env with just:

dist_env_1_gpu = dict(
            MASTER_ADDR="localhost", MASTER_PORT="10999", RANK="0", LOCAL_RANK="0", WORLD_SIZE="1"
        )
for k,v in dist_env_1_gpu.items():
    os.environ[k]=v

I'd be happy to try to investigate this on my own if you could help me with pointers at potential suspects.

Thank you!

@jeffra, @samyam

The text was updated successfully, but these errors were encountered:

This was referenced Mar 27, 2021

fix cpu_adam memory leak on deepspeed re-use in the same process #896

Merged

Are there memory leaks when using DeepSpeed on training T5? huggingface/transformers#10945

Closed

jeffra closed this as completed in #896 Apr 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak in an environment like notebook #879

memory leak in an environment like notebook #879

stas00 commented Mar 19, 2021 •

edited

Loading

memory leak in an environment like notebook #879

memory leak in an environment like notebook #879

Comments

stas00 commented Mar 19, 2021 • edited Loading

stas00 commented Mar 19, 2021 •

edited

Loading