Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory error (only after first epoch) #80

Closed
Samleo8 opened this issue Jun 2, 2020 · 1 comment
Closed

CUDA out of memory error (only after first epoch) #80

Samleo8 opened this issue Jun 2, 2020 · 1 comment

Comments

@Samleo8
Copy link

Samleo8 commented Jun 2, 2020

Following up from #79, instead of getting stuck on evaluation anymore (yay), it now reports a CUDA out of memory error after running the first epoch: RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 10.73 GiB total capacity; 6.34 GiB already allocated; 25.62 MiB free; 6.44 GiB reserved in total by PyTorch)

This interestingly only happens after the first epoch, and after I load from checkpoint of epoch 0.

Things already done:

  • I free up all GPU processes (checked with nvidia-smi) so that when the program starts, all 4 x 11GB of memory is free
  • Both train and val batch sizes are 1
  • As per https://stackoverflow.com/questions/60276672/cuda-and-pytorch-memory-usage, I have already included a torch.cuda.empty_cache() before training, which doesn't seem to help.
  • Anomaly detection turned off (didn't make a difference whether it's on or off)
  1. It seems to me that PyTorch is allocating less memory than it actually needs (there's actually about 2-3GB of memory not being used?); is there a way to overcome this?
  2. The StackOverflow post suggests moving some tensors to the CPU, but from the code it seems that you guys may already be doing that?
  3. I actually suspect the extra weights from training/checkpoint are causing a problem, but I am not sure how to overcome this
  4. Did you guys encounter a similar problem during training?

Thank you!


If it helps, the full stack trace is below:

  File "train.py", line 767, in <module>
    main(args)
  File "train.py", line 724, in main
    n_iters_total_train = one_epoch(model, criterion, opt, config, train_dataloader, device, epoch, n_iters_total=n_iters_total_train, is_train=True, master=master, experiment_dir=experiment
_dir, writer=writer)
  File "train.py", line 315, in one_epoch
    keypoints_3d_pred, heatmaps_pred, volumes_pred, confidences_pred, cuboids_pred, coord_volumes_pred, base_points_pred = model(images_batch, proj_matricies_batch, batch)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 445, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/learnable-triangulation-pytorch/mvn/models/triangulation.py", line 358, in forward
    volumes = self.volume_net(volumes)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/learnable-triangulation-pytorch/mvn/models/v2v.py", line 166, in forward
    x = self.encoder_decoder(x)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/learnable-triangulation-pytorch/mvn/models/v2v.py", line 135, in forward
    x = self.decoder_upsample1(x)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/learnable-triangulation-pytorch/mvn/models/v2v.py", line 66, in forward
    return self.block(x)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/scleong/.pyenv/versions/vol/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 933, in forward
    output_padding, self.groups, self.dilation)
RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 10.73 GiB total capacity; 6.34 GiB already allocated; 25.62 MiB free; 6.44 GiB reserved in total by PyTorch)
@Samleo8 Samleo8 changed the title CUDA out of memory error CUDA out of memory error (only after first epoch) Jun 2, 2020
@Samleo8
Copy link
Author

Samleo8 commented Jun 2, 2020

As a follow-up, the full nvidia-smi output before the CUDA out of memory error.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 29%   34C    P2    64W / 250W |   8559MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:4C:00.0 Off |                  N/A |
| 32%   49C    P2   111W / 250W |   7384MiB / 10989MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:4D:00.0 Off |                  N/A |
| 31%   42C    P2   114W / 250W |   7384MiB / 10989MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:4E:00.0  On |                  N/A |
| 29%   42C    P2   111W / 250W |   7397MiB / 10986MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4194      C   ...scleong/.pyenv/versions/vol/bin/python3  5047MiB |
|    0      4195      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    0      4196      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    0      4197      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    1      4195      C   ...scleong/.pyenv/versions/vol/bin/python3  7373MiB |
|    2      4196      C   ...scleong/.pyenv/versions/vol/bin/python3  7373MiB |
|    3      4197      C   ...scleong/.pyenv/versions/vol/bin/python3  7373MiB |
|    3      8683      G   /usr/lib/xorg/Xorg                            12MiB |
+-----------------------------------------------------------------------------+

and the output after:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:4B:00.0 Off |                  N/A |
| 29%   34C    P5    29W / 250W |   3512MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:4C:00.0 Off |                  N/A |
| 32%   50C    P2   112W / 250W |   7696MiB / 10989MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:4D:00.0 Off |                  N/A |
| 31%   42C    P2   113W / 250W |   7696MiB / 10989MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:4E:00.0  On |                  N/A |
| 29%   43C    P2   111W / 250W |   7709MiB / 10986MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4195      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    0      4196      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    0      4197      C   ...scleong/.pyenv/versions/vol/bin/python3  1167MiB |
|    1      4195      C   ...scleong/.pyenv/versions/vol/bin/python3  7685MiB |
|    2      4196      C   ...scleong/.pyenv/versions/vol/bin/python3  7685MiB |
|    3      4197      C   ...scleong/.pyenv/versions/vol/bin/python3  7685MiB |
|    3      8683      G   /usr/lib/xorg/Xorg                            12MiB |
+-----------------------------------------------------------------------------+

Incidentally, I also noticed that it throws the out of memory error on GPU 3, where there is also an XOrg process running -- could this be a problem?

UPDATE: It may actually be a problem! I re-ran the training algo but with only 3 GPUs used (ignoring the one that has XOrg running). Perhaps this is an issue with Pytorch only being able to use/partition fixed sizes of memory.

@Samleo8 Samleo8 closed this as completed Jun 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant