The reason for NaN #12591

KwangryeolPark · 2024-01-07T12:02:18Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Like other issues, I also see NaN during training yolov5m to coco dataset following the script in coco.yaml and README.md.

I try to figure out the reason for NaN and I find a hint in a Issue which indirectly is about amp (Auto Mixed Precision).

It makes sense that low precission has a higher chance to occur NaN during casting because of Underflow.

Therefore, I think, lots of NaN problem come from amp so I looks better to use NVIDIA apex which uses distribution shift to prevent distribution miss match.

Environment

YOLOv5m
torch:1.12.1+cu116
python: 3.8.12
dataset: coco
optimizer: CAME
epochs: 300
batch size: 40

Minimal Reproducible Example

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0

I use CAME optimizer with betas=(momentum, 0.999, 0.999)

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-07T12:02:41Z

👋 Hello @KwangryeolPark, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Introducing YOLOv8 🚀

We're excited to announce the launch of our latest state-of-the-art (SOTA) object detection model for 2023 - YOLOv8 🚀!

Designed to be fast, accurate, and easy to use, YOLOv8 is an ideal choice for a wide range of object detection, image segmentation and image classification tasks. With YOLOv8, you'll be able to quickly and accurately detect objects in real-time, streamline your workflows, and achieve new levels of accuracy in your projects.

Check out our YOLOv8 Docs for details and get started with:

pip install ultralytics

KwangryeolPark · 2024-01-07T12:02:53Z

I hope you fix the mixed precision problem.

glenn-jocher · 2024-01-07T18:52:31Z

@KwangryeolPark hello! Thanks for bringing this to our attention. NaNs during training can indeed sometimes be related to precision issues when using mixed precision training (AMP). However, there could be other factors at play, such as learning rate, weight initialization, or data preprocessing.

Regarding the use of NVIDIA apex, YOLOv5 uses PyTorch's native AMP implementation, which is generally recommended for its ease of use and integration. If you're experiencing NaNs with AMP, you might want to try the following:

Reduce the learning rate.
Increase the batch size if possible, as smaller batches can sometimes lead to instability with AMP.
Ensure that your data preprocessing is correct and that there are no anomalies in the dataset.
Experiment with different optimizers if the issue persists.

If you're willing to submit a PR, we'd be happy to review any improvements or fixes you propose. Just make sure to thoroughly test your changes to ensure they're beneficial across various scenarios.

Remember to check out our documentation for more details on troubleshooting and best practices: https://docs.ultralytics.com/yolov5/

Thanks for your contribution to the YOLOv5 community! 🚀

KwangryeolPark · 2024-01-08T04:38:33Z

@glenn-jocher Thank you for answer.

In order to set learning-rate, I see Training Arguments and find lr0 argument. However, when I add --lr0 0.001, the script shows train.py: error: unrecognized arguments: --lr0 1e-3.

glenn-jocher · 2024-01-08T07:07:04Z

Apologies for the confusion, @KwangryeolPark. The correct argument for setting the initial learning rate in the YOLOv5 training script is --lr. So, if you want to set the initial learning rate to 0.001, you should use the following command:

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0 --lr 0.001

Make sure to adjust the learning rate according to your specific needs and keep an eye on the training process to ensure stability. If you have any further questions or issues, don't hesitate to reach out. Happy training! 🚀

KwangryeolPark · 2024-01-08T07:45:08Z

@glenn-jocher Thank you for guidance. However, --lr 0.001 argument also occur: train.py: error: unrecognized arguments: --lr 0.001

glenn-jocher · 2024-01-08T15:47:06Z

I apologize for the oversight, @KwangryeolPark. In YOLOv5, the learning rate is set in the hyperparameter configuration file rather than as a command-line argument. You can adjust the learning rate by editing the hyp.scratch.yaml file or any other hyperparameter file you are using.

For example, to set the initial learning rate to 0.001, you would modify the lr0 value in your hyperparameter file like so:

lr0: 0.001  # initial learning rate

Then, you can reference this hyperparameter file during training using the --hyp argument:

python train.py --data coco.yaml --epochs 300 --weights '' --cfg yolov5m.yaml --batch-size 40 --optimizer CAME --device 0 --hyp your_hyperparameter_file.yaml

Replace your_hyperparameter_file.yaml with the path to your edited hyperparameter file. This should correctly set the initial learning rate for your training session. If you encounter any further issues, please let us know. Good luck with your training! 🌟

KwangryeolPark · 2024-01-08T15:48:17Z

Thank you

glenn-jocher · 2024-01-08T17:14:15Z

You're welcome, @KwangryeolPark! If you have any more questions or need further assistance in the future, feel free to reach out. Best of luck with your YOLOv5 training! Happy detecting! 🚀👀

KwangryeolPark added the bug Something isn't working label Jan 7, 2024

KwangryeolPark closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The reason for NaN #12591

The reason for NaN #12591

KwangryeolPark commented Jan 7, 2024 •

edited

Loading

github-actions bot commented Jan 7, 2024 •

edited by UltralyticsAssistant

Loading

KwangryeolPark commented Jan 7, 2024

glenn-jocher commented Jan 7, 2024

KwangryeolPark commented Jan 8, 2024

glenn-jocher commented Jan 8, 2024

KwangryeolPark commented Jan 8, 2024

glenn-jocher commented Jan 8, 2024

KwangryeolPark commented Jan 8, 2024

glenn-jocher commented Jan 8, 2024

The reason for NaN #12591

The reason for NaN #12591

Comments

KwangryeolPark commented Jan 7, 2024 • edited Loading

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

github-actions bot commented Jan 7, 2024 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

Introducing YOLOv8 🚀

KwangryeolPark commented Jan 7, 2024

glenn-jocher commented Jan 7, 2024

KwangryeolPark commented Jan 8, 2024

glenn-jocher commented Jan 8, 2024

KwangryeolPark commented Jan 8, 2024

glenn-jocher commented Jan 8, 2024

KwangryeolPark commented Jan 8, 2024

glenn-jocher commented Jan 8, 2024

KwangryeolPark commented Jan 7, 2024 •

edited

Loading

github-actions bot commented Jan 7, 2024 •

edited by UltralyticsAssistant

Loading