CUDA out of memory error (only after first epoch) #80

Samleo8 · 2020-06-02T03:28:08Z

Following up from #79, instead of getting stuck on evaluation anymore (yay), it now reports a CUDA out of memory error after running the first epoch: RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 10.73 GiB total capacity; 6.34 GiB already allocated; 25.62 MiB free; 6.44 GiB reserved in total by PyTorch)

This interestingly only happens after the first epoch, and after I load from checkpoint of epoch 0.

Things already done:

I free up all GPU processes (checked with nvidia-smi) so that when the program starts, all 4 x 11GB of memory is free
Both train and val batch sizes are 1
As per https://stackoverflow.com/questions/60276672/cuda-and-pytorch-memory-usage, I have already included a torch.cuda.empty_cache() before training, which doesn't seem to help.
Anomaly detection turned off (didn't make a difference whether it's on or off)

It seems to me that PyTorch is allocating less memory than it actually needs (there's actually about 2-3GB of memory not being used?); is there a way to overcome this?
The StackOverflow post suggests moving some tensors to the CPU, but from the code it seems that you guys may already be doing that?
I actually suspect the extra weights from training/checkpoint are causing a problem, but I am not sure how to overcome this
Did you guys encounter a similar problem during training?

Thank you!

If it helps, the full stack trace is below:

  File "train.py", line 767, in <module>
    main(args)
  File "train.py", line 724, in main
    n_iters_total_train = one_epoch(model, criterion, opt, config, train_dataloader, device, epoch, n_iters_total=n_iters_total_train, is_train=True, master=master, experiment_dir=experiment
_dir, writer=writer)
  File "train.py", line 315, in one_epoch
    keypoints_3d_pred, heatmaps_pred, volumes_pred, confidences_pred, cuboids_pred, coord_volumes_pred, base_points_pred = model(images_batch, proj_matricies_batch, batch)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/learnable-triangulation-pytorch/mvn/models/triangulation.py", line 358, in forward
    volumes = self.volume_net(volumes)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/learnable-triangulation-pytorch/mvn/models/v2v.py", line 166, in forward
    x = self.encoder_decoder(x)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/learnable-triangulation-pytorch/mvn/models/v2v.py", line 135, in forward
    x = self.decoder_upsample1(x)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/learnable-triangulation-pytorch/mvn/models/v2v.py", line 66, in forward
    return self.block(x)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 933, in forward
    output_padding, self.groups, self.dilation)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 10.73 GiB total capacity; 6.34 GiB already allocated; 25.62 MiB free; 6.44 GiB reserved in total by PyTorch)

The text was updated successfully, but these errors were encountered:

Samleo8 · 2020-06-02T09:53:29Z

As a follow-up, the full nvidia-smi output before the CUDA out of memory error.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 29%   34C    P2    64W / 250W |   8559MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:4C:00.0 Off |                  N/A |
| 32%   49C    P2   111W / 250W |   7384MiB / 10989MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:4D:00.0 Off |                  N/A |
| 31%   42C    P2   114W / 250W |   7384MiB / 10989MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:4E:00.0  On |                  N/A |
| 29%   42C    P2   111W / 250W |   7397MiB / 10986MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4194      C   ...scleong/.pyenv/versions/vol/bin/python3  5047MiB |
|    0      4195      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    0      4196      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    0      4197      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    1      4195      C   ...scleong/.pyenv/versions/vol/bin/python3  7373MiB |
|    2      4196      C   ...scleong/.pyenv/versions/vol/bin/python3  7373MiB |
|    3      4197      C   ...scleong/.pyenv/versions/vol/bin/python3  7373MiB |
|    3      8683      G   /usr/lib/xorg/Xorg                            12MiB |
+-----------------------------------------------------------------------------+

and the output after:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 29%   34C    P5    29W / 250W |   3512MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:4C:00.0 Off |                  N/A |
| 32%   50C    P2   112W / 250W |   7696MiB / 10989MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:4D:00.0 Off |                  N/A |
| 31%   42C    P2   113W / 250W |   7696MiB / 10989MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:4E:00.0  On |                  N/A |
| 29%   43C    P2   111W / 250W |   7709MiB / 10986MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4195      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    0      4196      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    0      4197      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    1      4195      C   ...scleong/.pyenv/versions/vol/bin/python3  7685MiB |
|    2      4196      C   ...scleong/.pyenv/versions/vol/bin/python3  7685MiB |
|    3      4197      C   ...scleong/.pyenv/versions/vol/bin/python3  7685MiB |
|    3      8683      G   /usr/lib/xorg/Xorg                            12MiB |
+-----------------------------------------------------------------------------+

Incidentally, I also noticed that it throws the out of memory error on GPU 3, where there is also an XOrg process running -- could this be a problem?

UPDATE: It may actually be a problem! I re-ran the training algo but with only 3 GPUs used (ignoring the one that has XOrg running). Perhaps this is an issue with Pytorch only being able to use/partition fixed sizes of memory.

Samleo8 changed the title ~~CUDA out of memory error~~ CUDA out of memory error (only after first epoch) Jun 2, 2020

Samleo8 closed this as completed Jun 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory error (only after first epoch) #80

CUDA out of memory error (only after first epoch) #80

Samleo8 commented Jun 2, 2020 •

edited

Loading

Samleo8 commented Jun 2, 2020 •

edited

Loading

CUDA out of memory error (only after first epoch) #80

CUDA out of memory error (only after first epoch) #80

Comments

Samleo8 commented Jun 2, 2020 • edited Loading

Samleo8 commented Jun 2, 2020 • edited Loading

Samleo8 commented Jun 2, 2020 •

edited

Loading

Samleo8 commented Jun 2, 2020 •

edited

Loading