-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory #14
Comments
I got the same problem, how do you solve it ? |
If memory problems occur during training, try reducing the batch size. e.g., utils.py def evaluate_cos_SOP(X, T, normalize=False):
|
Thank you for your reply, the problem has been solved. |
python train.py --gpu-id 0 --loss Proxy_Anchor --model r
esnet50 --embedding-size 512 --batch-size 180 --lr 6e-4 --dataset SOP --warm 1 --bn-freeze 0 --lr-decay-step 20 --lr-dec
ay-gamma 0.25
wandb: Currently logged in as: shute (use
wandb login --relogin
to force relogin)wandb: wandb version 0.10.10 is available! To upgrade, please run:
wandb: $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.10.8
wandb: Syncing run eager-dream-16
wandb: ⭐️ View project at https://wandb.ai/shute/SOP_ProxyAnchor
wandb: 🚀 View run at https://wandb.ai/shute/SOP_ProxyAnchor/runs/2ca7m47a
wandb: Run data is saved locally in wandb/run-20201113_090754-2ca7m47a
wandb: Run
wandb off
to turn off syncing.Random Sampling
Training parameters: {'LOG_DIR': '../logs', 'dataset': 'SOP', 'sz_embedding': 512, 'sz_batch': 180, 'nb_epochs': 60, 'gpu_id': 0, 'nb_workers': 4, 'model': 'resnet50', 'loss': 'Proxy_Anchor', 'optimizer': 'adamw', 'lr': 0.0006, 'weight_decay': 0.0001, 'lr_decay_step': 20, 'lr_decay_gamma': 0.25, 'alpha': 32, 'mrg': 0.1, 'IPC': None, 'warm': 1, 'bn_freeze': 0, 'l2_norm': 1, 'remark': ''}
Training for 60 epochs.
0it [00:00, ?it/s]/home/server8/lst/Proxy-Anchor-CVPR2020-master/code/losses.py:48: UserWarning: This overload of nonzero is deprecated:
nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
nonzero(Tensor input, *, bool as_tuple) (Triggered internally at /opt/conda/conda-bld/pytorch_1595629411241/work/torch/csrc/utils/python_arg_parser.cpp:766.)
with_pos_proxies = torch.nonzero(P_one_hot.sum(dim = 0) != 0).squeeze(dim = 1) # The set of positive proxies of data in the batch
Train Epoch: 0 [330/330 (100%)] Loss: 10.849229: : 330it [01:34, 3.49it/s]
Evaluating...
100%|██████████| 337/337 [01:25<00:00, 3.95it/s]
R@1 : 51.770
R@10 : 67.938
R@100 : 81.594
R@1000 : 92.909
0it [00:00, ?it/s]
Traceback (most recent call last):
File "train.py", line 290, in
m = model(x.squeeze().cuda())
File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/server8/lst/Proxy-Anchor-CVPR2020-master/code/net/resnet.py", line 175, in forward
x = self.model.layer1(x)
File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torchvision/models/resnet.py", line 112, in forward
out = self.conv3(out)
File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in forward
return self._conv_forward(input, self.weight)
File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 415, in _conv_forward
return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 552.00 MiB (GPU 0; 7.80 GiB total capacity; 6.09 GiB already allocated; 392.69 MiB free; 6.44 GiB reserved in total by PyTorch)
wandb: Waiting for W&B process to finish, PID 4172153
wandb: Program failed with code 1. Press ctrl-c to abort syncing.
wandb:
wandb: Find user logs for this run at: wandb/run-20201113_090754-2ca7m47a/logs/debug.log
wandb: Find internal logs for this run at: wandb/run-20201113_090754-2ca7m47a/logs/debug-internal.log
wandb: Run summary:
wandb: loss 12.3987
wandb: R@1 0.5177
wandb: R@10 0.67938
wandb: R@100 0.81594
wandb: R@1000 0.92909
wandb: _step 0
wandb: _runtime 193
wandb: _timestamp 1605276667
wandb: Run history:
wandb: loss ▁
wandb: R@1 ▁
wandb: R@10 ▁
wandb: R@100 ▁
wandb: R@1000 ▁
wandb: _step ▁
wandb: _runtime ▁
wandb: _timestamp ▁
wandb:
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb:
wandb: Synced eager-dream-16: https://wandb.ai/shute/SOP_ProxyAnchor/runs/2ca7m47a
CUDA out of memory in the second epoch.
I have set the batch size to 30, 100, 150, 180,. Nothing helps.
pytorch 1.6
CUDA 10.1
GPU RTX2080Super 8G
I have spent many hours, but still can not solve.
Many thanks for your help.
The text was updated successfully, but these errors were encountered: