Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix #1555

Closed
wants to merge 1 commit into from

Conversation

louis-she
Copy link

@louis-she louis-she commented Nov 30, 2020

The issue is not general and can only be reproduced by certain environment.

A possible reason of the bug could be the none fixed size of the label yield from the dataloader as the collect_fn does.

reference: Pinned Host Memory

#1546
#1547
#1573

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Improved DataLoader flexibility in YOLOv5.

📊 Key Changes

  • Removed the pin_memory=True argument from DataLoader instantiation.

🎯 Purpose & Impact

  • This change is aimed at making the data loading process more adaptable, as pin_memory may not always be necessary or beneficial depending on the user's hardware setup.
  • Users might experience different memory usage characteristics, potentially reducing memory overhead on systems where pinning memory doesn't provide performance benefits.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @louis-she, thank you for submitting a PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • Verify your PR is up-to-date with origin/master. If your PR is behind origin/master update by running the following, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git rebase upstream/master
git push -u origin -f
  • Verify all Continuous Integration (CI) checks are passing.
  • Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

@NanoCode012
Copy link
Contributor

May I ask about the performance change? From that article, it shows that pinned_memory performed faster by two times. What would be the result of this change?

@glenn-jocher
Copy link
Member

@NanoCode012 that's a good question. I was wondering this myself. I tested COCO128 on Colab Pro (with V100), and saw no difference, but this is a small model. I'll test 1 epoch of COCO on a GCP VM and see.

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 30, 2020

Timing results here for following command on GCP V100 instance. Master appears to be about 4% faster based on this single-epoch test.

python train.py --data coco.yaml --weights yolov5m.pt --batch 40 --epochs 1

Master:

1 epochs completed in 0.361 hours, mAP 0.387  # experiment 1
1 epochs completed in 0.365 hours, mAP 0.389  # experiment 2

This PR:

1 epochs completed in 0.375 hours, mAP 0.390  # experiment 1
1 epochs completed in 0.374 hours, mAP 0.389  # experiment 2

@NanoCode012
Copy link
Contributor

NanoCode012 commented Nov 30, 2020

I did my own tests too. On Latest Docker.
Command :

python train.py --data coco.yaml --weights yolov5m.pt --epochs 2
Epoch 1 Epoch 2 Total (include test) mAP (epoch 2)
Master 27:17 24:55 0.907h 0.595
PR 29:19 26:41 0.973h 0.596

Around 7% longer for me.

@NanoCode012
Copy link
Contributor

What's interesting is that I notice that in both of your results and mine, this PR bumped up the mAP by around 0.002. Maybe there is some observation that can be made here?

@glenn-jocher
Copy link
Member

@NanoCode012 hmm that's really strange. I'm running a second experiment to verify results of my first now.

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 30, 2020

Updated results from second experiment. The mAP seems to be essentially identical. Speed difference is real, about 3-4% on first epoch here.

Master:

1 epochs completed in 0.361 hours, mAPs 0.583       0.387  # experiment 1
1 epochs completed in 0.365 hours, mAPs 0.586       0.389  # experiment 2

This PR:

1 epochs completed in 0.375 hours, mAP 0.588        0.39  # experiment 1
1 epochs completed in 0.374 hours, mAP 0.589       0.389  # experiment 2

@louis-she
Copy link
Author

louis-she commented Nov 30, 2020 via email

@NanoCode012
Copy link
Contributor

NanoCode012 commented Nov 30, 2020

Env: Docker
Command :

python train.py --data coco.yaml --weights yolov5m.pt --epochs 2 #exp1
python train.py --data coco.yaml --weights yolov5m.pt --epochs 1 #exp2/3
Epoch 1 Epoch 2 Total (include test) mAP (epoch 2)
Master 1 27:17 24:55 0.907h 0.595
PR 1 29:19 26:41 0.973h 0.596
Master 2 27:50 - - -
PR 2 28:10 - 0.495h 0.604
Master 3 26:43 - 0.469h 0.604
PR 3 27:12 - 0.477h 0.604

For some reason, my second experiment shows a shorter gap between the two. (I accidentally erased the output of the terminal, so I lost the other values..)

@louis-she , maybe there is an environment problem that is causing you to have the problem and doing this is just a patch on it.. Maybe you can list your environment.

@louis-she
Copy link
Author

louis-she commented Nov 30, 2020

@NanoCode012
Here's my environment:

System Ubuntu 20.04
Python 3.8.5
Nvidia Driver installed with apt package nvidia-driver-450
PyTorch 1.7 ( installed via conda )
RTX 2080 Ti

Indeed it's an environment related issue cause only a few people are facing this problem. Turning pin_memory option off is just the work around for those.

So should we close this PR and the issues as well ?

@glenn-jocher glenn-jocher changed the title FIX issue 1546 1547 FIX issue 1546 1547 1573 Dec 2, 2020
@glenn-jocher
Copy link
Member

@NanoCode012 issue #1573 raised with this same problem today. If 3 users have had this problem it might make sense to implement this change. Everyone will have 5% slower training but perhaps a few percent of the users will have their problem resolved.

@glenn-jocher glenn-jocher changed the title FIX issue 1546 1547 1573 RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR fix Dec 2, 2020
@louis-she
Copy link
Author

@glenn-jocher i'm not sure if this solution can solve the issues by the others. It's necessary to let them try this PR to confirm before merging.

@glenn-jocher
Copy link
Member

@louis-she yes that's a good idea.

@NanoCode012
Copy link
Contributor

issue #1573 raised with this same problem today. If 3 users have had this problem it might make sense to implement this change. Everyone will have 5% slower training but perhaps a few percent of the users will have their problem resolved.

I see. I'm actually curious which change caused this. As mentioned in the issue linked,

I just pulled the latest commits to the repo and run the same training command that I've been using until yesterday but it doesn't work anymore after pulling the latest commits.

It could be a good idea to try trace down which caused it although it would be time consuming and we can't replicate the env.. A brute force attempt would be to clone various versions of the repo and find out when the error started to occur around..

@louis-she , has the repo ever worked for you or had you always applied this pinned_memory fix for it to work?

@idenc
Copy link
Contributor

idenc commented Dec 29, 2020

Changing pinned_memory to false does not stop my CUDNN_STATUS_INTERNAL_ERROR from occurring

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 29, 2020

@idenc interesting, thanks. I think this error may appear when training close to the CUDA memory threshold. If you use a smaller --img or smaller --batch do you still see the error?

@idenc
Copy link
Contributor

idenc commented Dec 29, 2020

Yes, the only fix I could find is reverting to an earlier revision

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale Stale and schedule for closing soon label Jan 29, 2021
@github-actions github-actions bot closed this Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Stale and schedule for closing soon
Projects
None yet
4 participants