-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Memory Usage is higher than other Pytorch implementation? #182
Comments
Hi, Here are the memory requirements that we have for Faster R-CNN R-50 on COCO. One thing to keep in mind is that if the meaning of Maybe that answers your question? |
My environment: Training parameters: GPU Memory Usage: really high.. |
Oh, you should not be looking at the output from The reason why we do it is because a Let me know if you have more questions |
Thanks for your explanation. The max mem in log file is about 3.5Gb. |
Cool, let me know if something else is not clear. @XudongWang12Sigma does that answer your question as well? |
Hi, thank you for your reply. In fact, when I was training Faster-RCNN using this repo https://github.com/jwyang/faster-rcnn.pytorch, I also use nvidia-smi to track the memory usage, but the memory usage appeared in terminal is about 2.5 G lower than your repo. I double checked my codes, I used 8 GPUs and default setting and COCO datasets, so, I am not sure whether I can do something to lower the memory usage, or lower to the level of Jwyang's repo in nividia-smi? Because I need to train some datasets with larger image size, it will be out of memory sometimes. Thanks |
Hi, One current place where the memory usage can be made more efficient is in the box IoU computation, see #18 A current workaround is to compute the iou on the CPU. Let me know if that addresses the issue for you, but I'll look in improving the memory usage of that function |
Also, One thing to do is to print the memory used by the other repo by calling |
@fmassa The actual memory required may be larger than Here are some memory usage data captured from experiments.
|
Hi @hellock, Indeed, the memory allocated by the CUDA driver (which can go up to 1GB or more, and happens when you initialize cuda) is not counted in I still think though that the Simple exampleAs a basic example (run in a new interpreter): import torch
for i in range(10):
a = torch.rand(1000, 1000, i + 1, device='cuda:0')
print(torch.cuda.max_memory_allocated(), torch.cuda.max_memory_cached()) will print
while (again in a new interpreter) import torch
for i in range(10):
a = torch.rand(1000, 1000, i + 1, device='cuda:0')
print(torch.cuda.max_memory_allocated(), torch.cuda.max_memory_cached())
del a # now a is not in scope anymore gives
which indicates that what we might indeed want to log is In the first example, because In both cases though, we see that Let me know what you think |
@fmassa Thanks for your reply. I agree that |
When I am training on VisDrone Dataset (http://www.aiskyeye.com/). Maybe there are many objects on this Dataset. MY ENVIRONMENT as follows: OS: Ubuntu 18.04.1 LTS Python version: 2.7 Nvidia driver version: 410.48 Versions of relevant libraries: |
@jario-jin There is probably an image in your dataset which contains a lot of objects. |
any progress on making a more accurate memory indicator? |
@qianyizhang we have recently discovered that, for newer generations of GPUs, the allocations via cudaMalloc seems to be rounding up to blocks of 2MB (which could potentially be split by cuda afterwards). This was not being taken into account by PyTorch caching allocator, so whenever we allocated Here is an example (which should allocate 10 GB): import torch
for _ in range(10000):
a.append(torch.empty(1048576 + 1, dtype=torch.uint8, device='cuda')) but instead we would have an error as follows:
I believe there is a patch being worked on PyTorch side which should reduce this gap, and it is currently being tested to see if there are no adverse effects. |
Here is a PR that improves the memory reporting in PyTorch. I should allow fitting larger batch sizes with |
Thanks! Will try |
just fyi, the PR is now merged, and is part of pytorch-nightly |
@fmassa this post seems to be talking about why the logged memory is more accurate, which is pointless. Because what nvidia-smi reflects is the actual memory needed to run the model (whether the code gets the out of memory error depends on this). So I am wondering if there is a way to make the model runs with less nvidia-smi memory? Right now, I am going out of memory with batch size of 1 on a 2080 ti, which is ridiculous? Or is it normal that the model in this repo is so memory intensive? |
@yxchng the model is generally not memory intensive, the values reported by nvidia-smi are not representative of the total amount of memory required by the model. If you are running out of memory, maybe you have a lot (> 50) ground-truth boxes in your image and the IoU computation is a bottleneck? Check #18 |
@fmassa I am using the COCO 2017 dataset with config |
@yxchng is this during training or during testing? Also, the memory reported is per GPU, so if there is 1 image per GPU, then you should consider 7GB per image. |
@fmassa |
❓ Questions and Help
I try to run Faster-RCNN with resnet-50 as backbone on COCO dataset, and it seems like that the memory usage for your implementation is: 9.4 G for 2 ims/gpu(scales=800). And for the implementation of: https://github.com/jwyang/faster-rcnn.pytorch, faster-rcnn with resnet-50 only need to use around 6G. Your memory usage is also larger than official Detectron(7.2G with FPN). I am not sure whether I made some mistakes? or your implementation really occupy more memory? Thanks
The text was updated successfully, but these errors were encountered: