RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555

louis-she · 2020-11-30T03:52:35Z

The issue is not general and can only be reproduced by certain environment.

A possible reason of the bug could be the none fixed size of the label yield from the dataloader as the collect_fn does.

reference: Pinned Host Memory

#1546
#1547
#1573

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Improved DataLoader flexibility in YOLOv5.

📊 Key Changes

Removed the pin_memory=True argument from DataLoader instantiation.

🎯 Purpose & Impact

This change is aimed at making the data loading process more adaptable, as pin_memory may not always be necessary or beneficial depending on the user's hardware setup.
Users might experience different memory usage characteristics, potentially reducing memory overhead on systems where pinning memory doesn't provide performance benefits.

github-actions

Hello @louis-she, thank you for submitting a PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

Verify your PR is up-to-date with origin/master. If your PR is behind origin/master update by running the following, replacing 'feature' with the name of your local branch:

git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f

Verify all Continuous Integration (CI) checks are passing.
Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

NanoCode012 · 2020-11-30T10:49:39Z

May I ask about the performance change? From that article, it shows that pinned_memory performed faster by two times. What would be the result of this change?

glenn-jocher · 2020-11-30T11:00:10Z

@NanoCode012 that's a good question. I was wondering this myself. I tested COCO128 on Colab Pro (with V100), and saw no difference, but this is a small model. I'll test 1 epoch of COCO on a GCP VM and see.

glenn-jocher · 2020-11-30T12:41:54Z

Timing results here for following command on GCP V100 instance. Master appears to be about 4% faster based on this single-epoch test.

python train.py --data coco.yaml --weights yolov5m.pt --batch 40 --epochs 1

Master:

1 epochs completed in 0.361 hours, mAP 0.387  # experiment 1
1 epochs completed in 0.365 hours, mAP 0.389  # experiment 2

This PR:

1 epochs completed in 0.375 hours, mAP 0.390  # experiment 1
1 epochs completed in 0.374 hours, mAP 0.389  # experiment 2

NanoCode012 · 2020-11-30T13:29:29Z

I did my own tests too. On Latest Docker.
Command :

python train.py --data coco.yaml --weights yolov5m.pt --epochs 2

	Epoch 1	Epoch 2	Total (include test)	mAP (epoch 2)
Master	27:17	24:55	0.907h	0.595
PR	29:19	26:41	0.973h	0.596

Around 7% longer for me.

NanoCode012 · 2020-11-30T13:32:42Z

What's interesting is that I notice that in both of your results and mine, this PR bumped up the mAP by around 0.002. Maybe there is some observation that can be made here?

glenn-jocher · 2020-11-30T13:33:36Z

@NanoCode012 hmm that's really strange. I'm running a second experiment to verify results of my first now.

glenn-jocher · 2020-11-30T13:42:10Z

Updated results from second experiment. The mAP seems to be essentially identical. Speed difference is real, about 3-4% on first epoch here.

Master:

1 epochs completed in 0.361 hours, mAPs 0.583       0.387  # experiment 1
1 epochs completed in 0.365 hours, mAPs 0.586       0.389  # experiment 2

This PR:

1 epochs completed in 0.375 hours, mAP 0.588        0.39  # experiment 1
1 epochs completed in 0.374 hours, mAP 0.589       0.389  # experiment 2

louis-she · 2020-11-30T14:14:10Z

Sorry that I can't test the performance because if that option was opened, the program will crash after a few iteration on my machine. Glenn Jocher <[email protected]>于2020年11月30日周一下午9:42写道：

…

Updated results from second experiment. The mAP seems to be essentially identical. Speed difference is real, about 3-4% on first epoch here. Master: 1 epochs completed in 0.361 hours, mAP 0.387 # experiment 1 1 epochs completed in 0.365 hours, mAP 0.389 # experiment 2 This PR: 1 epochs completed in 0.375 hours, mAP 0.390 # experiment 1 1 epochs completed in 0.374 hours, mAP 0.389 # experiment 2 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1555 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPURH7UXLWTR3R22TRQXGTSSOOMFANCNFSM4UHAKHAA> .

NanoCode012 · 2020-11-30T14:27:08Z

Env: Docker
Command :

python train.py --data coco.yaml --weights yolov5m.pt --epochs 2 #exp1
python train.py --data coco.yaml --weights yolov5m.pt --epochs 1 #exp2/3

	Epoch 1	Epoch 2	Total (include test)	mAP (epoch 2)
Master 1	27:17	24:55	0.907h	0.595
PR 1	29:19	26:41	0.973h	0.596

Master 2	27:50	-	-	-
PR 2	28:10	-	0.495h	0.604

Master 3	26:43	-	0.469h	0.604
PR 3	27:12	-	0.477h	0.604

For some reason, my second experiment shows a shorter gap between the two. (I accidentally erased the output of the terminal, so I lost the other values..)

@louis-she , maybe there is an environment problem that is causing you to have the problem and doing this is just a patch on it.. Maybe you can list your environment.

louis-she · 2020-11-30T14:58:43Z

@NanoCode012
Here's my environment:

System Ubuntu 20.04
Python 3.8.5
Nvidia Driver installed with apt package nvidia-driver-450
PyTorch 1.7 ( installed via conda )
RTX 2080 Ti

Indeed it's an environment related issue cause only a few people are facing this problem. Turning pin_memory option off is just the work around for those.

So should we close this PR and the issues as well ?

glenn-jocher · 2020-12-02T10:35:20Z

@NanoCode012 issue #1573 raised with this same problem today. If 3 users have had this problem it might make sense to implement this change. Everyone will have 5% slower training but perhaps a few percent of the users will have their problem resolved.

louis-she · 2020-12-02T13:55:28Z

@glenn-jocher i'm not sure if this solution can solve the issues by the others. It's necessary to let them try this PR to confirm before merging.

glenn-jocher · 2020-12-02T14:08:13Z

@louis-she yes that's a good idea.

NanoCode012 · 2020-12-02T14:09:45Z

issue #1573 raised with this same problem today. If 3 users have had this problem it might make sense to implement this change. Everyone will have 5% slower training but perhaps a few percent of the users will have their problem resolved.

I see. I'm actually curious which change caused this. As mentioned in the issue linked,

I just pulled the latest commits to the repo and run the same training command that I've been using until yesterday but it doesn't work anymore after pulling the latest commits.

It could be a good idea to try trace down which caused it although it would be time consuming and we can't replicate the env.. A brute force attempt would be to clone various versions of the repo and find out when the error started to occur around..

@louis-she , has the repo ever worked for you or had you always applied this pinned_memory fix for it to work?

idenc · 2020-12-29T06:11:50Z

Changing pinned_memory to false does not stop my CUDNN_STATUS_INTERNAL_ERROR from occurring

glenn-jocher · 2020-12-29T19:22:53Z

@idenc interesting, thanks. I think this error may appear when training close to the CUDA memory threshold. If you use a smaller --img or smaller --batch do you still see the error?

idenc · 2020-12-29T21:04:05Z

Yes, the only fix I could find is reverting to an earlier revision

github-actions · 2021-01-29T00:46:23Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

FIX issue 1547

81bcd92

github-actions bot reviewed Nov 30, 2020

View reviewed changes

glenn-jocher closed this Nov 30, 2020

glenn-jocher reopened this Nov 30, 2020

glenn-jocher mentioned this pull request Dec 2, 2020

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #1573

Closed

glenn-jocher changed the title ~~FIX issue 1546 1547~~ FIX issue 1546 1547 1573 Dec 2, 2020

This was linked to issues Dec 2, 2020

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #1546

Closed

some bugs when training #1547

Closed

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #1573

Closed

dataloader workers #1574

Closed

glenn-jocher changed the title ~~FIX issue 1546 1547 1573~~ RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix Dec 2, 2020

glenn-jocher mentioned this pull request Dec 2, 2020

some bugs when training #1547

Closed

github-actions bot added the Stale Stale and schedule for closing soon label Jan 29, 2021

github-actions bot closed this Feb 4, 2021

abuelgasimsaadeldin mentioned this pull request Feb 17, 2021

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #2236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555

louis-she commented Nov 30, 2020 •

edited by UltralyticsAssistant

Loading

github-actions bot left a comment

NanoCode012 commented Nov 30, 2020

glenn-jocher commented Nov 30, 2020

glenn-jocher commented Nov 30, 2020 •

edited

Loading

NanoCode012 commented Nov 30, 2020 •

edited

Loading

NanoCode012 commented Nov 30, 2020

glenn-jocher commented Nov 30, 2020

glenn-jocher commented Nov 30, 2020 •

edited

Loading

louis-she commented Nov 30, 2020 via email

NanoCode012 commented Nov 30, 2020 •

edited

Loading

louis-she commented Nov 30, 2020 •

edited

Loading

glenn-jocher commented Dec 2, 2020

louis-she commented Dec 2, 2020

glenn-jocher commented Dec 2, 2020

NanoCode012 commented Dec 2, 2020

idenc commented Dec 29, 2020

glenn-jocher commented Dec 29, 2020 •

edited

Loading

idenc commented Dec 29, 2020

github-actions bot commented Jan 29, 2021

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555

Conversation

louis-she commented Nov 30, 2020 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot left a comment

Choose a reason for hiding this comment

NanoCode012 commented Nov 30, 2020

glenn-jocher commented Nov 30, 2020

glenn-jocher commented Nov 30, 2020 • edited Loading

NanoCode012 commented Nov 30, 2020 • edited Loading

NanoCode012 commented Nov 30, 2020

glenn-jocher commented Nov 30, 2020

glenn-jocher commented Nov 30, 2020 • edited Loading

louis-she commented Nov 30, 2020 via email

NanoCode012 commented Nov 30, 2020 • edited Loading

louis-she commented Nov 30, 2020 • edited Loading

glenn-jocher commented Dec 2, 2020

louis-she commented Dec 2, 2020

glenn-jocher commented Dec 2, 2020

NanoCode012 commented Dec 2, 2020

idenc commented Dec 29, 2020

glenn-jocher commented Dec 29, 2020 • edited Loading

idenc commented Dec 29, 2020

github-actions bot commented Jan 29, 2021

louis-she commented Nov 30, 2020 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Nov 30, 2020 •

edited

Loading

NanoCode012 commented Nov 30, 2020 •

edited

Loading

glenn-jocher commented Nov 30, 2020 •

edited

Loading

NanoCode012 commented Nov 30, 2020 •

edited

Loading

louis-she commented Nov 30, 2020 •

edited

Loading

glenn-jocher commented Dec 29, 2020 •

edited

Loading