The pre-processing phase of a yolo training takes a substantial amount of time #16394

MarcoMagl · 2024-09-20T13:28:40Z

MarcoMagl
Sep 20, 2024

Hello to everybody,

I am a bit stuck on an investigation I am doing.
When launching yolo (whether it is using the software stack we have on our HPC without a venv, using a conda environment, or using the ultralytics container), the different steps prior to the startup of the training are quite long.

In particular, I am worried about the caching of the images that is quite long.

I have around 1700 images, and it takes more than one minute to move data from the storage to the RAM of the GPU I am using. I also tried to first move the data from the storage to the local memory /dev/shm of the node on which I run experiments but it does not help.

I have around 25 to 30 it/s when it comes to moving data. Checking the io with htop also shows me that the writing processes are quite slow.
Has anyone got experience in profiling this startup phase? I tried using the scalene profiler but it did not help.
Also, has anyone found some tricks on the hyperparameters settings to speed up this data transfer without compromising the subsequent trainings? I tried different number of workers but it did not help.

Links to related issues would also be greatly appreciated.

UltralyticsAssistant · 2024-09-20T13:29:06Z

UltralyticsAssistant
Sep 20, 2024
Maintainer

👋 Hello @MarcoMagl, thank you for reaching out to the Ultralytics community 🚀! We understand that the pre-processing phase can be time-consuming and appreciate your detailed description.

For performance issues, especially related to data transfer and caching, please provide a minimum reproducible example if possible. This will greatly assist our team in diagnosing the issue more effectively.

In the meantime, make sure to check out our Docs for potential optimizations and guidance on improving the data pipeline. You might also want to review our Tips for Best Training Results to see if there are any applicable strategies.

To stay connected and exchange ideas with other users, join our Discord community 🎧. For more detailed discussions, you can also explore our Discourse forum or participate in conversations on our Subreddit.

Upgrade

Ensure you are using the latest ultralytics package with all the required dependencies. You can update with:

pip install -U ultralytics

Make sure your environment meets these requirements, ideally in a Python>=3.8 environment with PyTorch>=1.8.

Environments

Consider running YOLOv8 in one of our verified environments to potentially enhance pre-processing speeds:

Notebooks with free GPU:
Google Cloud Deep Learning VM: GCP Quickstart Guide
Amazon Deep Learning AMI: AWS Quickstart Guide
Docker Image: Docker Quickstart Guide

Status

This is an automated message and an Ultralytics engineer will assist you soon 😊. Thank you for your patience and cooperation!

0 replies

glenn-jocher · 2024-09-20T16:38:20Z

glenn-jocher
Sep 20, 2024
Maintainer

@MarcoMagl to improve data transfer speed, consider using faster storage solutions or optimizing your data pipeline. You might also try reducing image size or format for quicker loading. For more detailed profiling, tools like PyTorch's built-in profiler could be helpful. If you need further assistance, our documentation at https://docs.ultralytics.com/guides/yolo-common-issues/ may offer additional insights.

0 replies

MarcoMagl · 2024-09-23T08:17:35Z

MarcoMagl
Sep 23, 2024
Author

Hi @glenn-jocher, thanks for answering.
I tried different tiers for the storage and that does not affect significantly the transfer rate.
Actually, I even tried to first move the data to the Local SSD of the GPU node I am using, and this does not significantly improve the transfer speed.

5 replies

glenn-jocher Sep 23, 2024
Maintainer

@MarcoMagl it sounds like the issue might be related to I/O bottlenecks. You could try increasing the cache setting to ram in your training configuration to see if it improves speed. Additionally, ensure your data loader is optimized for your specific hardware setup.

MarcoMagl Sep 24, 2024
Author

@glenn-jocher thanks for the reply.
I already set the the cache setting to "ram"
The dataset I am using can fit without problem into the RAM of the GPU accelerators I am using (NVIDIA A100 with 40 GB of memory).

An example of workflow I can launch (after I copied the dataset to /dev/shm) would be:

#!/bin/bash
module load Apptainer
t=ultralytics/ultralytics:latest
apptainer pull docker://$t
apptainer run --nv --bind $PWD --bind /dev/shm ultralytics_latest.sif yolo train cfg=train_meluxina.yaml data=coco_mo.yaml model=yolov8n.pt

where train_meluxina.yaml is

device: 0,1,2,3
batch: 8
workers: 32
epochs: 100
patience: 100
imgsz: 4032
cache: ram

and the part of the pre-training phase that worries me is the caching of the training set:

....
train: Scanning /dev/shm/data_test/train/labels... 1706 images, 0 backgrounds, 0 corrupt: 100%|██████████| 1706/1706 [00:01<00:00, 1664.10it/s]
train: New cache created: /dev/shm/data_test/train/labels.cache
WARNING ⚠️ cache='ram' may produce non-deterministic training results. Consider cache='disk' as a deterministic alternative if your disk space allows.
train: Caching images (38.7GB RAM): 100%|██████████| 1706/1706 [01:04<00:00, 26.28it/s]

it last one minute to move 38.7 Gb of RAM which sounds like a lot to me just to move data. I assumed that this 'caching' phase in YOLO just consists in moving data from the storage to the RAM but maybe I am wrong.

I am trying to use profiling to better analyse this pre-training phase but was not successful so far.

Best Regards,

Marco

glenn-jocher Sep 24, 2024
Maintainer

Hi Marco,

It seems like the caching process is indeed taking longer than expected. You might want to try reducing the image size (imgsz) to see if it speeds up the caching phase. Additionally, consider using a different profiler like PyTorch's built-in profiler for more insights. If the issue persists, you might want to explore optimizing your data pipeline further.

MarcoMagl Oct 4, 2024
Author

Hello,

What is the methodology you recommend to profile the performances (by model I mean usage of the GPUs, CPU time per function,...) a model like YOLO("yolov8n.pt") ?

glenn-jocher Oct 4, 2024
Maintainer

You can use PyTorch's built-in profiler to analyze GPU and CPU usage for models like YOLO("yolov8n.pt"). It's a great tool for detailed performance insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ultralytics

The pre-processing phase of a yolo training takes a substantial amount of time #16394

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ultralytics

The pre-processing phase of a yolo training takes a substantial amount of time #16394

MarcoMagl Sep 20, 2024

Replies: 3 comments · 5 replies

UltralyticsAssistant Sep 20, 2024 Maintainer

Upgrade

Environments

Status

glenn-jocher Sep 20, 2024 Maintainer

MarcoMagl Sep 23, 2024 Author

glenn-jocher Sep 23, 2024 Maintainer

MarcoMagl Sep 24, 2024 Author

glenn-jocher Sep 24, 2024 Maintainer

MarcoMagl Oct 4, 2024 Author

glenn-jocher Oct 4, 2024 Maintainer

MarcoMagl
Sep 20, 2024

Replies: 3 comments 5 replies

UltralyticsAssistant
Sep 20, 2024
Maintainer

glenn-jocher
Sep 20, 2024
Maintainer

MarcoMagl
Sep 23, 2024
Author

glenn-jocher Sep 23, 2024
Maintainer

MarcoMagl Sep 24, 2024
Author

glenn-jocher Sep 24, 2024
Maintainer

MarcoMagl Oct 4, 2024
Author

glenn-jocher Oct 4, 2024
Maintainer