Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch Profiler causes memory leak #10717

Closed
nils-werner opened this issue Nov 23, 2021 · 7 comments · Fixed by #10837
Closed

Pytorch Profiler causes memory leak #10717

nils-werner opened this issue Nov 23, 2021 · 7 comments · Fixed by #10837
Labels
bug Something isn't working priority: 0 High priority task profiler

Comments

@nils-werner
Copy link

nils-werner commented Nov 23, 2021

🐛 Bug

It seems like chosing the Pytorch profiler causes an ever growing amount of RAM being allocated. This even continues after training, probably while the profiler data is processed.

After a certain number of epochs, this causes an OOM and triggers my Kernel to kill the process.

To Reproduce

To reproduce, simply enable the profiler on one of the provided examples

cd pl_examples/basic_examples/mnist_examples
python image_classifier_5_lightning_datamodule.py --trainer.profiler=pytorch --trainer.gpus=1

On my machine, sometime mid epoch=3, I am OOM and the process gets killed.

Expected behavior

The memory leak does not occur

Environment

 CUDA:
        - GPU:
                - NVIDIA GeForce GTX 1060 6GB
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.21.4
        - pyTorch_debug:     False
        - pyTorch_version:   1.10.0+cu102
        - pytorch-lightning: 1.6.0dev
        - tqdm:              4.62.3
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         
        - python:            3.8.12
        - version:           #1 ZEN SMP PREEMPT Thu, 18 Nov 2021 22:23:53 +0000

Additional context

I am aware that this might be caused by Pytorch and not Lightning and I am currently trying to reproduce this issue in plain Pytorch. If I can reproduce it, this issue can of course be triaged to them.

cc @tchaton @carmocca @kaushikb11 @ninginthecloud

@nils-werner nils-werner added the bug Something isn't working label Nov 23, 2021
@nils-werner
Copy link
Author

nils-werner commented Nov 24, 2021

I have noticed the same issue in plain Pytorch when using torch.autograd.profiler.profile() outside of nn.Modules, i.e. when it also contains data loading

for epoch in range(epochs):
    with torch.autograd.profiler.profile(use_cuda=True) as prof:
        train()
        test()

but it is OK if you use it in your nn.Module, i.e. when you are only profiling the math ops

class Net(nn.Module):
    # ...
    def forward(self, x):
        with torch.autograd.profiler.profile(use_cuda=True) as prof:
            # ...

@nils-werner
Copy link
Author

nils-werner commented Nov 24, 2021

I can reproduce a memory leak in long-running profiling tasks using torch.profiler.profile(), too:

for epoch in range(epochs):
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        record_shapes=True,
    ) as prof:
        train()
        test()

which can be prevented by using a schedule

for epoch in range(epochs):
    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        record_shapes=True,
        schedule=torch.profiler.schedule(
            wait=1,
            warmup=1,
            active=2
        ),
    ) as prof:
        train()
        test()

@nils-werner
Copy link
Author

nils-werner commented Nov 24, 2021

I am just stabbing at the sourcecode a bit here: If I remove the block profiler/pytorch.py:415-421

# the default schedule requires a minimum of 5 steps to properly work: `wait=1, warmup=1, active=3`.
# otherwise, this will raise a `segmentation fault`.
if self._should_override_schedule():
    warning_cache.warn(
        "The PyTorch Profiler default schedule will be overridden as there is not enough "
        "steps to properly record traces."
    )
    self._schedule = None
    self.profiler.schedule = torch.profiler.profiler._default_schedule_fn

The MNIST example immediately consumes 4.6GB of RAM, but does not seem to leak it.

Note that disabling the profiler entirely the training only uses 380MB of RAM.

@tchaton tchaton added the priority: 0 High priority task label Nov 24, 2021
@nils-werner
Copy link
Author

nils-werner commented Nov 24, 2021

In general I find it a little bit strange that self._schedule is changed in PyTorchProfiler.stop(). start() and stop() are called repeatedly during training (once per batch?), which means the schedule changes after the first batch:

If I put a print(self._schedule) directly at the beginning of stop(), I see the following output:

Training: 0it [00:00, ?it/s]
<pytorch_lightning.profiler.pytorch.ScheduleWrapper object at 0x7ff737d75310>
Epoch 0:   0%|                               | 0/1875 [00:00<?, ?it/s]
<pytorch_lightning.profiler.pytorch.ScheduleWrapper object at 0x7ff737d75310>
<pytorch_lightning.profiler.pytorch.ScheduleWrapper object at 0x7ff737d75310>
<pytorch_lightning.profiler.pytorch.ScheduleWrapper object at 0x7ff737d75310>
# ...
/home/nils/Arbeit/repro/pytorch-lightning/pytorch_lightning/profiler/pytorch.py:417: UserWarning: The PyTorch Profiler default schedule will be overridden as there is not enough steps to properly record traces.
  warning_cache.warn(
None
Epoch 0:   0%|                               | 1/1875 [00:00<01:53, 16.53it/s, loss=2.33, v_num=31]
None
None
# ...

@nils-werner
Copy link
Author

nils-werner commented Nov 24, 2021

Ok, and if move the entire block

# the default schedule requires a minimum of 5 steps to properly work: `wait=1, warmup=1, active=3`.
# otherwise, this will raise a `segmentation fault`.
if self._should_override_schedule():
    warning_cache.warn(
        "The PyTorch Profiler default schedule will be overridden as there is not enough "
        "steps to properly record traces."
    )
    self._schedule = None
    self.profiler.schedule = torch.profiler.profiler._default_schedule_fn

out of stop() and to the end of _init_kineto() the schedule remains constant during training and the leak is gone. Note that I am still just poking at the sourcecode here and am not sure if _init_kineto() is indeed the correct place for this block.

@rohitgr7
Copy link
Contributor

@nils-werner thanks for raising this issue and for the pointers. 😃
Can you try installing the PR branch and check if the issue still exists??

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git@fix/pt_prof_leak

@nils-werner
Copy link
Author

Yes, this PR fixes the issue at my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 0 High priority task profiler
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants