-
-
Notifications
You must be signed in to change notification settings - Fork 16.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @louis-she, thank you for submitting a PR! To allow your work to be integrated as seamlessly as possible, we advise you to:
- Verify your PR is up-to-date with origin/master. If your PR is behind origin/master update by running the following, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f
- Verify all Continuous Integration (CI) checks are passing.
- Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee
May I ask about the performance change? From that article, it shows that |
@NanoCode012 that's a good question. I was wondering this myself. I tested COCO128 on Colab Pro (with V100), and saw no difference, but this is a small model. I'll test 1 epoch of COCO on a GCP VM and see. |
Timing results here for following command on GCP V100 instance. Master appears to be about 4% faster based on this single-epoch test. python train.py --data coco.yaml --weights yolov5m.pt --batch 40 --epochs 1 Master: 1 epochs completed in 0.361 hours, mAP 0.387 # experiment 1
1 epochs completed in 0.365 hours, mAP 0.389 # experiment 2 This PR: 1 epochs completed in 0.375 hours, mAP 0.390 # experiment 1
1 epochs completed in 0.374 hours, mAP 0.389 # experiment 2 |
I did my own tests too. On Latest Docker.
Around 7% longer for me. |
What's interesting is that I notice that in both of your results and mine, this PR bumped up the mAP by around 0.002. Maybe there is some observation that can be made here? |
@NanoCode012 hmm that's really strange. I'm running a second experiment to verify results of my first now. |
Updated results from second experiment. The mAP seems to be essentially identical. Speed difference is real, about 3-4% on first epoch here. Master: 1 epochs completed in 0.361 hours, mAPs 0.583 0.387 # experiment 1
1 epochs completed in 0.365 hours, mAPs 0.586 0.389 # experiment 2 This PR: 1 epochs completed in 0.375 hours, mAP 0.588 0.39 # experiment 1
1 epochs completed in 0.374 hours, mAP 0.589 0.389 # experiment 2 |
Sorry that I can't test the performance because if that option was opened,
the program will crash after a few iteration on my machine.
Glenn Jocher <[email protected]>于2020年11月30日 周一下午9:42写道:
… Updated results from second experiment. The mAP seems to be essentially
identical. Speed difference is real, about 3-4% on first epoch here.
Master:
1 epochs completed in 0.361 hours, mAP 0.387 # experiment 1
1 epochs completed in 0.365 hours, mAP 0.389 # experiment 2
This PR:
1 epochs completed in 0.375 hours, mAP 0.390 # experiment 1
1 epochs completed in 0.374 hours, mAP 0.389 # experiment 2
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1555 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPURH7UXLWTR3R22TRQXGTSSOOMFANCNFSM4UHAKHAA>
.
|
Env: Docker
For some reason, my second experiment shows a shorter gap between the two. (I accidentally erased the output of the terminal, so I lost the other values..) @louis-she , maybe there is an environment problem that is causing you to have the problem and doing this is just a patch on it.. Maybe you can list your environment. |
@NanoCode012
Indeed it's an environment related issue cause only a few people are facing this problem. Turning So should we close this PR and the issues as well ? |
@NanoCode012 issue #1573 raised with this same problem today. If 3 users have had this problem it might make sense to implement this change. Everyone will have 5% slower training but perhaps a few percent of the users will have their problem resolved. |
@glenn-jocher i'm not sure if this solution can solve the issues by the others. It's necessary to let them try this PR to confirm before merging. |
@louis-she yes that's a good idea. |
I see. I'm actually curious which change caused this. As mentioned in the issue linked,
It could be a good idea to try trace down which caused it although it would be time consuming and we can't replicate the env.. A brute force attempt would be to clone various versions of the repo and find out when the error started to occur around.. @louis-she , has the repo ever worked for you or had you always applied this |
Changing |
@idenc interesting, thanks. I think this error may appear when training close to the CUDA memory threshold. If you use a smaller |
Yes, the only fix I could find is reverting to an earlier revision |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
The issue is not general and can only be reproduced by certain environment.
A possible reason of the bug could be the none fixed size of the label yield from the dataloader as the
collect_fn
does.reference: Pinned Host Memory
#1546
#1547
#1573
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Improved DataLoader flexibility in YOLOv5.
📊 Key Changes
pin_memory=True
argument from DataLoader instantiation.🎯 Purpose & Impact
pin_memory
may not always be necessary or beneficial depending on the user's hardware setup.