Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The reason for NaN #12591

Closed
2 tasks done
KwangryeolPark opened this issue Jan 7, 2024 · 9 comments
Closed
2 tasks done

The reason for NaN #12591

KwangryeolPark opened this issue Jan 7, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@KwangryeolPark
Copy link

KwangryeolPark commented Jan 7, 2024

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Like other issues, I also see NaN during training yolov5m to coco dataset following the script in coco.yaml and README.md.

I try to figure out the reason for NaN and I find a hint in a Issue which indirectly is about amp (Auto Mixed Precision).

It makes sense that low precission has a higher chance to occur NaN during casting because of Underflow.

Therefore, I think, lots of NaN problem come from amp so I looks better to use NVIDIA apex which uses distribution shift to prevent distribution miss match.

Environment

YOLOv5m
torch:1.12.1+cu116
python: 3.8.12
dataset: coco
optimizer: CAME
epochs: 300
batch size: 40

Minimal Reproducible Example

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0

I use CAME optimizer with betas=(momentum, 0.999, 0.999)

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@KwangryeolPark KwangryeolPark added the bug Something isn't working label Jan 7, 2024
Copy link
Contributor

github-actions bot commented Jan 7, 2024

👋 Hello @KwangryeolPark, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

YOLOv5 CI

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

@KwangryeolPark
Copy link
Author

I hope you fix the mixed precision problem.

@glenn-jocher
Copy link
Member

@KwangryeolPark hello! Thanks for bringing this to our attention. NaNs during training can indeed sometimes be related to precision issues when using mixed precision training (AMP). However, there could be other factors at play, such as learning rate, weight initialization, or data preprocessing.

Regarding the use of NVIDIA apex, YOLOv5 uses PyTorch's native AMP implementation, which is generally recommended for its ease of use and integration. If you're experiencing NaNs with AMP, you might want to try the following:

  1. Reduce the learning rate.
  2. Increase the batch size if possible, as smaller batches can sometimes lead to instability with AMP.
  3. Ensure that your data preprocessing is correct and that there are no anomalies in the dataset.
  4. Experiment with different optimizers if the issue persists.

If you're willing to submit a PR, we'd be happy to review any improvements or fixes you propose. Just make sure to thoroughly test your changes to ensure they're beneficial across various scenarios.

Remember to check out our documentation for more details on troubleshooting and best practices: https://docs.ultralytics.com/yolov5/

Thanks for your contribution to the YOLOv5 community! 🚀

@KwangryeolPark
Copy link
Author

@glenn-jocher Thank you for answer.

In order to set learning-rate, I see Training Arguments and find lr0 argument. However, when I add --lr0 0.001, the script shows train.py: error: unrecognized arguments: --lr0 1e-3.

@glenn-jocher
Copy link
Member

Apologies for the confusion, @KwangryeolPark. The correct argument for setting the initial learning rate in the YOLOv5 training script is --lr. So, if you want to set the initial learning rate to 0.001, you should use the following command:

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0 --lr 0.001

Make sure to adjust the learning rate according to your specific needs and keep an eye on the training process to ensure stability. If you have any further questions or issues, don't hesitate to reach out. Happy training! 🚀

@KwangryeolPark
Copy link
Author

@glenn-jocher Thank you for guidance. However, --lr 0.001 argument also occur: train.py: error: unrecognized arguments: --lr 0.001

@glenn-jocher
Copy link
Member

I apologize for the oversight, @KwangryeolPark. In YOLOv5, the learning rate is set in the hyperparameter configuration file rather than as a command-line argument. You can adjust the learning rate by editing the hyp.scratch.yaml file or any other hyperparameter file you are using.

For example, to set the initial learning rate to 0.001, you would modify the lr0 value in your hyperparameter file like so:

lr0: 0.001  # initial learning rate

Then, you can reference this hyperparameter file during training using the --hyp argument:

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0 --hyp your_hyperparameter_file.yaml

Replace your_hyperparameter_file.yaml with the path to your edited hyperparameter file. This should correctly set the initial learning rate for your training session. If you encounter any further issues, please let us know. Good luck with your training! 🌟

@KwangryeolPark
Copy link
Author

Thank you

@glenn-jocher
Copy link
Member

You're welcome, @KwangryeolPark! If you have any more questions or need further assistance in the future, feel free to reach out. Best of luck with your YOLOv5 training! Happy detecting! 🚀👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants