Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU ram memory increase until overflow when using PSNR and SSIM #2597

Open
ouioui199 opened this issue Jun 14, 2024 · 2 comments
Open

GPU ram memory increase until overflow when using PSNR and SSIM #2597

ouioui199 opened this issue Jun 14, 2024 · 2 comments
Assignees
Labels
bug / fix Something isn't working question Further information is requested v1.3.x

Comments

@ouioui199
Copy link

ouioui199 commented Jun 14, 2024

🐛 Bug

Hello all,

I'm implementing CycleGAN with Lightning. I use PSNR and SSIM from torchmetrics for evaluation.
During training, I see that my GPU ram memory increases non stop until overflow and the whole training shuts down.
This might similar to #2481

To Reproduce

Add this to init method of model class:

self.train_metrics = MetricCollection({"PSNR": PeakSignalNoiseRatio(), "SSIM": StructuralSimilarityIndexMeasure()})
self.valid_metrics = self.train_metrics.clone(prefix='val_')

In training_step method:
train_metrics = self.train_metrics(fake, real)

In validation_step method:
valid_metrics = self.valid_metrics(fake, real)

Environment

  • TorchMetrics version: 1.3.0 installed via pip
  • Python: 3.11.7
  • Pytorch: 2.1.2
  • Issue encountered when training on Window 10

Easy fix proposition

I try to debug the code.
When verifying train_metrics, I get this:

"{'PSNR': tensor(10.5713, device='cuda:0', grad_fn=<SqueezeBackward0>), 'SSIM': tensor(0.0373, device='cuda:0', grad_fn=<SqueezeBackward0>)}"

which is weird because metrics aren't supposed to be attached to computational graph.
When verifying valid_metrics, I don't see grad_fn.
Guessing that's the issue, I tried to call fake.detach() when computing train_metrics.
Now the training is stable, the GPU memory stops increasing non stop.

@ouioui199 ouioui199 added bug / fix Something isn't working help wanted Extra attention is needed labels Jun 14, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@Borda
Copy link
Member

Borda commented Aug 21, 2024

@ouioui199 looking at your example (could you pls share the full sample code?) and wondering if you in the epoch end hook also call compute?

@Borda Borda added question Further information is requested and removed help wanted Extra attention is needed labels Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working question Further information is requested v1.3.x
Projects
None yet
Development

No branches or pull requests

3 participants