-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Memory Leak on Loading Pre-Trained Checkpoint #6515
Comments
@bilzard thanks for the PR! Would it make more sense (less code or easier to understand) to just load directly on CPU with one of these other options? i.e. state_dict = torch.load(directory, map_location=lambda storage, loc: storage)
state_dict = torch.load(directory) # option 2
state_dict = torch.load(directory, map_location=torch.device('cpu')) # option 3 |
@glenn-jocher Option 2 shouldn't work because default is loaded to GPU.
|
I checked option 3 worked on my server (GPU memory wasn't increased). $ watch nvidia-smi # map-location = cpu, --weigts='yolov5l'
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 61C P0 41W / 70W | 14603MiB / 15360MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4509 C python 14599MiB |
+-----------------------------------------------------------------------------+
$ watch nvidia-smi # map-location = cpu, --weigts=path_to_pretrained.pt
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01 Driver Version: 510.39.01 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 35C P0 71W / 70W | 14541MiB / 15360MiB | 91% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2669 C python 14537MiB |
+-----------------------------------------------------------------------------+
|
I think we should also fix code for loading model from torch hub, but I don't know how to test. https://github.com/ultralytics/yolov5/blob/master/hubconf.py#L52 |
FYI: I fixed code as option 3. |
Good question. For PyTorch Hub we may want to leave as is for startup time speeds. Since PyTorch Hub models may be used in APIs like https://ultralytics.com/yolov5 that are only called once the response time may be more important than reducing the CUDA usage slightly. Another point is that simple inference uses much less CUDA memory than training, mabe only about 1/3 or 1/2 of training memory. But I also am not sure, it would need some study. |
O.K. We need to study response time when changing loading method. Then I will stay it as is at this time. |
@bilzard yes that's correct. For training an extra second on initializing won't matter, nor in val.py, but for detect.py and PyTorch Hub we probably want to prioritize fastest time to get first results returned. |
Search before asking
YOLOv5 Component
Training
Bug
Training YOLO from a checkpoint (*.pt) consumes more GPU memory than training from a pre-trained weight (i.e. yolov5l).
Environment
Minimal Reproducible Example
In the below training command, case 2 requires more GPU memory than case 1.
Additional
As reported on the pytorch forum[1], loading state dict on CUDA device causes memory leak. We should load it on CPU memory:
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: