Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Objects365.yaml to include the official validation set #5194

Merged
merged 3 commits into from
Oct 15, 2021

Conversation

farleylai
Copy link
Contributor

@farleylai farleylai commented Oct 15, 2021

Include the official Objects365 validation set to download and convert the labels

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced Objects365.yaml to support both training and validation data splits.

📊 Key Changes

  • 🔄 Implemented separate code paths for handling 'train' and 'val' data splits.
  • ➕ Added download links and processes for the validation dataset.
  • ⬇️ Automated downloading of annotation files and images.
  • 🚚 Streamlined moving of images to their corresponding directory after download.
  • 📝 Adapted label generation code to work with both training and validation images.

🎯 Purpose & Impact

  • 🔍 The changes allow easier access to both training and validation datasets within the Objects365 dataset, facilitating proper machine learning model evaluation.
  • 🤖 Users of the yolov5 repository can now expect more streamlined download and setup processes for using the Objects365 dataset, potentially leading to more robust model training and validation.
  • 👌 This update simplifies the usability of the dataset setup scripts, potentially increasing adoption and improving user experience.

Download the official Objects365 validation set and convert the labels
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hello @farleylai, thank you for submitting a 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • ✅ Verify your PR is up-to-date with origin/master. If your PR is behind origin/master an automatic GitHub actions rebase may be attempted by including the /rebase command in a comment body, or by running the following code, replacing 'feature' with the name of your local branch:
git remote add upstream https://github.com/ultralytics/yolov5.git
git fetch upstream
git checkout feature  # <----- replace 'feature' with local branch name
git merge upstream/master
git push -u origin -f
  • ✅ Verify all Continuous Integration (CI) checks are passing.
  • ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." -Bruce Lee

@farleylai
Copy link
Contributor Author

farleylai commented Oct 15, 2021

dataset_stats shows the scanning stats.
Are those corrupted instances expected and the dataset/labels remain useful?

Scanning '../datasets/Objects365/labels/train' images and labels...1615704 found, 0 missing, 0 empty, 102543 corrupted: 100%|█| 1615712/1615712 [47:59<00:00, 56
Scanning '../datasets/Objects365/labels/val' images and labels...80000 found, 0 missing, 0 empty, 7312 corrupted: 100%|██| 80000/80000 [02:56<00:00, 452.57it/s]

@glenn-jocher
Copy link
Member

@farleylai thanks for the PR! I've cleaned it up a bit without changing the functionality.

I don't have the dataset downloaded currently, so I'm not sure what's normal, but 0.1M/1.6M seems like an excessive fraction of corrupted images. Typically datasets might have a few problem images that fall into this category, but usually these are <<1% of the total.

I should mention the one time we downloaded objects365 before we had to restart the download script multiple times due to incomplete downloads. The curl commands should be retry-friendly, so they should recognize partially downloaded files and resume downloading where they left off.

What do the error messages say on most of your corrupted images specifically?

@glenn-jocher glenn-jocher merged commit fc36064 into ultralytics:master Oct 15, 2021
@glenn-jocher
Copy link
Member

@farleylai PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

@farleylai
Copy link
Contributor Author

farleylai commented Oct 15, 2021

I checked out the warning msgs in the cache and it all says something like "../datasets/Objects365/images/val/objects365_v2_01926665.jpg: non-normalized or out of bounds coordinate labels'."

Take objects365_v2_01926665.jpg for example.
The image info and labels are as follows:

$> file ../datasets/Objects365/images/val/objects365_v2_01926665.jpg
../datasets/Objects365/images/val/objects365_v2_01926665.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 1024x748, frames 3
$> cat ../datasets/Objects365/labels/val/objects365_v2_01926665.txt
9 0.50000 0.80787 1.00000 0.38427
95 0.68296 0.13299 0.11006 0.13893
117 0.49981 0.58072 1.00039 0.54321
209 0.54939 0.10846 0.13147 0.17548

Then trace back to the original coco annotation showing the out of bounds coordinates as follows:

[{'id': 24272310, 'iscrowd': 1, 'isfake': 0, 'area': 416239.1306653306, 'isreflected': 0, 'bbox': [-0.3929443328, 231.2183837524, 1024.4027099136, 406.32373053800006], 'image_id': 1926665, 'category_id': 118}]

It is now clear those coordinates must be clamped before normalization in case.
I will try to make the next PR fix.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 15, 2021

@farleylai ok got it. We want to clip these as xyxy labels then, and afterward convert to xywh. We do this for xView dataset as well using xyxy2xywhn(..., clip=True) (n stands for normalized), otherwise the same issue happens there with slight out of bounds coordinates causing errors:

box = xyxy2xywhn(box[None].astype(np.float), w=shapes[id][0], h=shapes[id][1], clip=True)

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 15, 2021

The function only accept nx4 numpy arrays or torch tensors, so we probably want to do something like this:

xyxy = np.array([x, y, x+w, y+h]).reshape(1,4)  # pixels
xywhn_clipped = xyxy2xywhn(xyxy, w=image_width, h=image_height, clip=True)  # normalized and clipped

yolov5/utils/general.py

Lines 535 to 545 in fc36064

def xyxy2xywhn(x, w=640, h=640, clip=False, eps=0.0):
# Convert nx4 boxes from [x1, y1, x2, y2] to [x, y, w, h] normalized where xy1=top-left, xy2=bottom-right
if clip:
clip_coords(x, (h - eps, w - eps)) # warning: inplace clip
y = x.clone() if isinstance(x, torch.Tensor) else np.copy(x)
y[:, 0] = ((x[:, 0] + x[:, 2]) / 2) / w # x center
y[:, 1] = ((x[:, 1] + x[:, 3]) / 2) / h # y center
y[:, 2] = (x[:, 2] - x[:, 0]) / w # width
y[:, 3] = (x[:, 3] - x[:, 1]) / h # height
return y

@glenn-jocher
Copy link
Member

@farleylai I just downloaded the dataset over the last few hours on an EC2 instance. I seem to have downloaded everything successfully, though I can't tell if all of the patches succeeded in downloading and unzipping. Unfortunately we haven't really built in any overall report into this. My stats look a little larger than yours though, I have 1742289 training images, so it's likely you may be missing a few training patches. The download occupies about 700 GB on my hard drive (including undeleted zips).

train: Scanning '../datasets/Objects365/labels/train' images and labels...1742289 found, 0 missing, 0 empty, 111771 corrupted: 100%|███████| 1742289/1742289 [09:21<00:00, 3101.89it/s]
val: Scanning '../datasets/Objects365/labels/val' images and labels...80000 found, 0 missing, 0 empty, 7312 corrupted: 100%|███████| 80000/80000 [00:06<00:00, 12564.88it/s]

I also see a significant number of 'corrupted' messages due to duplicate labels. This is just two identical rows (same class, same exact coordinates). We handle these as errors rather than warnings, so the image will be rejected if any of these occur:

train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051762.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051841.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051849.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051921.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051929.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052000.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052013.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052022.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052046.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052076.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052150.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052158.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052161.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052200.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052205.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052225.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052235.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052255.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052318.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052417.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052420.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052444.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052472.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052486.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052490.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052528.jpg: duplicate labels

@glenn-jocher
Copy link
Member

Anecdotally everything seems fine, boxes appear correct. That's very interesting that the class indices seem to be ordered by frequency, I've never seen that before.

train_batch1

labels

labels_correlogram

@farleylai
Copy link
Contributor Author

I do not have high bandwidth to test the full download of the training set in a few hours. There could be some failed patches earlier on. See if I may reproduce the same numbers as yours soon.

As for those rejected labels, it is indeed an issue and must be fixed before this dataset to be useful. Hopefully, the PR can be submitted in one day.

@glenn-jocher
Copy link
Member

@farleylai ok! Don't worry about the duplicate labels, I'll push a seperate PR to handle those better inside the dataset checks. I'll convert them from errors to warnings and automatically fix them I think.

@glenn-jocher
Copy link
Member

@farleylai duplicate labels are now handled automatically (warning + fixed automatically) and all other label errors feature improved introspection to point out to users the exact values causing the problem. See PR #5210, merged into master now.

@farleylai
Copy link
Contributor Author

Glad those duplicate labels can be safely handled.
I also confirmed several incomplete downloaded patch tarballs were identified by tar -tzf and those can be resumed with curl -L -O -C - . Nonetheless, given the download(..) already includes --retry 9 -C -, the failure may be attributed to temporary network instability. Perhaps, those incomplete download events should be recorded and reported for a later retry/recovery.

@glenn-jocher
Copy link
Member

@farleylai yeah it would be useful to capture and report download failures as part of download.py. It would also be much better if there was a way to print to screen more clearly when doing multithread downloads, that by itself would help to spot failed downloads. Right now the 8 threads print over each other and the resulting terminal window becomes unreadable. I don't know an easy fix, but I know it's possible, i.e. Docker does this when downloading from multiple sources.

@farleylai
Copy link
Contributor Author

farleylai commented Oct 15, 2021

After those patches are finally downloaded in full, there seems like only one image difference from your download in the training set. However, after moving the jpg images to the same directory, the total training set image count becomes 1742289 that could imply an duplicate image stored in one the archives. This can be a bit nontrivial to identify.

Per patch tarball images:

Training set with 1742290 (1742289 distinct) images in total:

$> find train -mindepth 1 -type f -name "patch*.tar.gz" -print0 | xargs -0 -I {} sh -c "echo {}'\t'\`tar -tzf {} | grep jpg | wc -l\`"
train/patch46.tar.gz	36000
train/patch40.tar.gz	34366
train/patch19.tar.gz	34369
train/patch2.tar.gz	34915
train/patch20.tar.gz	34337
train/patch10.tar.gz	34529
train/patch31.tar.gz	34314
train/patch18.tar.gz	34374
train/patch11.tar.gz	34053
train/patch36.tar.gz	34364
train/patch7.tar.gz	34565
train/patch50.tar.gz	34357
train/patch22.tar.gz	34354
train/patch35.tar.gz	34288
train/patch45.tar.gz	19437
train/patch0.tar.gz	34797
train/patch4.tar.gz	34520
train/patch32.tar.gz	34424
train/patch9.tar.gz	34758
train/patch17.tar.gz	34422
train/patch14.tar.gz	33156
train/patch38.tar.gz	34468
train/patch15.tar.gz	31477
train/patch21.tar.gz	34391
train/patch5.tar.gz	34468
train/patch28.tar.gz	34404
train/patch49.tar.gz	36000
train/patch39.tar.gz	34366
train/patch30.tar.gz	34450
train/patch23.tar.gz	34346
train/patch47.tar.gz	36000
train/patch37.tar.gz	34377
train/patch8.tar.gz	34555
train/patch48.tar.gz	36000
train/patch41.tar.gz	34341
train/patch43.tar.gz	34829
train/patch13.tar.gz	32938
train/patch27.tar.gz	34435
train/patch16.tar.gz	34341
train/patch42.tar.gz	34349
train/patch44.tar.gz	36040
train/patch25.tar.gz	34454
train/patch34.tar.gz	34310
train/patch12.tar.gz	32891
train/patch24.tar.gz	34410
train/patch26.tar.gz	34357
train/patch6.tar.gz	34497
train/patch1.tar.gz	34722
train/patch33.tar.gz	34444
train/patch29.tar.gz	34494
train/patch3.tar.gz	34437

Validation set with 8000 images in total:

$> find val -mindepth 1 -type f -name "patch*.tar.gz" -print0 | xargs -0 -I {} sh -c "echo {}'\t'\`tar -tzf {} | grep jpg | wc -l\`"
val/patch40.tar.gz	1807
val/patch19.tar.gz	1851
val/patch2.tar.gz	1246
val/patch20.tar.gz	1778
val/patch10.tar.gz	1570
val/patch31.tar.gz	1887
val/patch18.tar.gz	1828
val/patch11.tar.gz	2186
val/patch36.tar.gz	1766
val/patch7.tar.gz	1641
val/patch22.tar.gz	1805
val/patch35.tar.gz	1847
val/patch0.tar.gz	1311
val/patch4.tar.gz	1595
val/patch32.tar.gz	1789
val/patch9.tar.gz	1501
val/patch17.tar.gz	1769
val/patch14.tar.gz	3115
val/patch38.tar.gz	1762
val/patch15.tar.gz	1038
val/patch21.tar.gz	1798
val/patch5.tar.gz	1679
val/patch28.tar.gz	1723
val/patch39.tar.gz	1761
val/patch30.tar.gz	1779
val/patch23.tar.gz	1826
val/patch37.tar.gz	1780
val/patch8.tar.gz	1705
val/patch41.tar.gz	1828
val/patch43.tar.gz	1293
val/patch13.tar.gz	3392
val/patch27.tar.gz	1740
val/patch16.tar.gz	1789
val/patch42.tar.gz	1840
val/patch25.tar.gz	1755
val/patch34.tar.gz	1889
val/patch12.tar.gz	3564
val/patch24.tar.gz	1748
val/patch26.tar.gz	1758
val/patch6.tar.gz	1747
val/patch1.tar.gz	1249
val/patch33.tar.gz	1785
val/patch29.tar.gz	1709
val/patch3.tar.gz	1771

@glenn-jocher
Copy link
Member

@farleylai great! #5214 implements clipping we discussed. The dataset caches with zero corrupt images now.

@farleylai
Copy link
Contributor Author

farleylai commented Oct 16, 2021

You're quick and I wish to have as fast computing resources for validation though.
Glad Objects356 is finally usable and worth training.

BTW, I can confirm those duplicate labels are indeed in the original annotations. Perhaps, it was caused when the creator was merging the annotations from multiple workers. Take WARNING: ../datasets/Objects365/images/val/objects365_v1_00669134.jpg: 27 duplicate labels removed for example. The original annotations sorted in the order of category ids and bboxes have almost every entry duplicate:

(9, ['0.56549', '0.54617', '101.01782', '106.62991'])
(63, ['488.57837', '101.07703', '144.38501', '126.85852'])
(63, ['488.57837', '101.07703', '144.38501', '126.85852'])
(63, ['495.71875', '237.90430', '144.43140', '171.04572'])
(63, ['495.71875', '237.90430', '144.43140', '171.04572'])
(83, ['156.13245', '238.91760', '53.47083', '50.74274'])
(83, ['156.13245', '238.91760', '53.47083', '50.74274'])
(83, ['194.32593', '206.18036', '45.28650', '43.10406'])
(83, ['194.32593', '206.18036', '45.28650', '43.10406'])
(83, ['215.60510', '145.78528', '41.15546', '28.99585'])
(83, ['215.60510', '145.78528', '41.15546', '28.99585'])
(105, ['183.95911', '291.84283', '48.56024', '50.74280'])
(105, ['209.05768', '251.46692', '51.83398', '51.28833'])
(105, ['435.12170', '84.78058', '28.64856', '20.17505'])
(105, ['435.12170', '84.78058', '28.64856', '20.17505'])
(105, ['460.94580', '65.00900', '26.63110', '26.63107'])
(105, ['460.94580', '65.00900', '26.63110', '26.63107'])
(108, ['243.19794', '368.39868', '53.78265', '55.65332'])
(108, ['243.19794', '368.39868', '53.78265', '55.65332'])
(108, ['275.93524', '355.30377', '22.91608', '24.78680'])
(108, ['275.93524', '355.30377', '22.91608', '24.78680'])
(108, ['285.28870', '323.50183', '57.05640', '52.84735'])
(108, ['285.28870', '323.50183', '57.05640', '52.84735'])
(108, ['294.17456', '374.47845', '55.18567', '52.37964'])
(108, ['294.17456', '374.47845', '55.18567', '52.37964'])
(108, ['317.09064', '291.23230', '42.55841', '39.28467'])
(108, ['317.09064', '291.23230', '42.55841', '39.28467'])
(108, ['334.86230', '357.17450', '40.68774', '32.26953'])
(108, ['334.86230', '357.17450', '40.68774', '32.26953'])
(108, ['341.87744', '318.35742', '43.49377', '44.89685'])
(108, ['341.87744', '318.35742', '43.49377', '44.89685'])
(142, ['243.39313', '18.17514', '87.40271', '81.65799'])
(142, ['243.39313', '18.17514', '87.40271', '81.65799'])
(142, ['294.40613', '-1.30167', '40.78308', '28.20508'])
(142, ['294.40613', '-1.30167', '40.78308', '28.20508'])
(142, ['322.17871', '0.12009', '66.47534', '67.70639'])
(142, ['322.17871', '0.12009', '66.47534', '67.70639'])
(153, ['27.74731', '265.87366', '158.26715', '155.37903'])
(153, ['27.74731', '265.87366', '158.26715', '155.37903'])
(159, ['324.27277', '171.03973', '62.93573', '103.82391'])
(159, ['324.27277', '171.03973', '62.93573', '103.82391'])
(196, ['107.10449', '155.60645', '77.16638', '107.09766'])
(196, ['107.10449', '155.60645', '77.16638', '107.09766'])
(217, ['254.92322', '275.26453', '44.09509', '44.49591'])
(217, ['254.92322', '275.26453', '44.09509', '44.49591'])
(217, ['295.96729', '265.96484', '43.49384', '41.38922'])
(217, ['295.96729', '265.96484', '43.49384', '41.38922'])
(251, ['372.37653', '171.03973', '68.54779', '77.36688'])
(251, ['372.37653', '171.03973', '68.54779', '77.36688'])
(251, ['451.74768', '193.08728', '96.60828', '67.74609'])
(251, ['451.74768', '193.08728', '96.60828', '67.74609'])
(262, ['127.67511', '96.63177', '102.23822', '75.09027'])
(262, ['127.67511', '96.63177', '102.23822', '75.09027'])
(265, ['0.67590', '91.43323', '138.55151', '333.86279'])
(289, ['158.08105', '168.23370', '75.76337', '75.76337'])
(289, ['158.08105', '168.23370', '75.76337', '75.76337'])
(289, ['230.10303', '156.07416', '50.50891', '71.55432'])
(289, ['230.10303', '156.07416', '50.50891', '71.55432'])
(297, ['45.11407', '289.19269', '89.52289', '112.98071'])
(297, ['64.12659', '290.98798', '123.71802', '120.64038'])

@farleylai
Copy link
Contributor Author

A recent Copy-Paste augmentation combined with self-training on Objects365 seemingly boosting the COCO performance by 1.5% without TTA may deserve a look: https://arxiv.org/abs/2012.07177

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 17, 2021

@farleylai yes, the data will always have issues, so the best thing to do is fix whats's fixable and ignore (but notify user about) problem images/labels.

Though I'm also surprised that an organization would expend the resources to label almost 2 million images and not do basic cleaning and checking of their data.

@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 20, 2021

@farleylai I trained a YOLOv5m model on Objects365 following this PR and the other related fixes. Everything works well. [email protected]:0.95 was only 18.5 after 30 epochs, but person mAP was similar to COCO, about 55 [email protected]:0.95. I'm sure this could also be improved with more epochs and additional tweaks, but at first glance all is good here.

DDP train command:

python train.py --data Objects365.yaml --batch 224 --weights --cfg yolov5m.yaml --epochs 30 --img 640 --hyp hyp.scratch-low.yaml --device 0,1,2,3,4,5,6

Results

# YOLOv5m v6.0 COCO 300 epochs
                 all       5000      36335      0.726      0.569      0.633      0.439
              person       5000      10777      0.792      0.735      0.807      0.554

# YOLOv5m v6.0 Objects365 30 epochs
                 all      80000    1239576      0.626      0.265      0.273      0.185
              Person      80000      80332      0.599      0.765      0.759       0.57

@farleylai
Copy link
Contributor Author

farleylai commented Oct 20, 2021

Looks very promising and somewhat manageable compared with OpenImages.
Their paper in 2019 was based on v1 where the selling point is much better transfer to COCO.
The v2 is nearly three times larger (600K+ vs 1742K+) but there are not many results on v2 yet.
There is one on one-shot detection though rejected in ICLR2021.

@glenn-jocher glenn-jocher self-assigned this Oct 23, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Oct 23, 2021

@farleylai trained model uploaded to https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5m_Objects365.pt. This is just a first time training, so I'm sure there's room for improvement in the results.

Yes I agree, I like this dataset. It's got more classes and more images than COCO. Here are YOLOv5m detections for both. With Objects365 you get additional useful categories, i.e. shoes, sunglasses

COCO Objects365
image image
image image

@farleylai
Copy link
Contributor Author

While COCO is widely used for benchmarking, the limited number of classes does not help much detect rich contextual objects other than persons in diverse real applications. Though transfer from Objects365 to COCO is likely to improve the benchmark results, the other way around from COCO to Objects365 could be more useful in practice. Before that, I think a well-tuned baseline would be necessary and the results should be at least or even much better than v1.

@glenn-jocher
Copy link
Member

@farleylai yes good points!

@ahong007007
Copy link

@farleylai I trained a YOLOv5m model on Objects365 following this PR and the other related fixes. Everything works well. [email protected]:0.95 was only 18.5 after 30 epochs, but person mAP was similar to COCO, about 55 [email protected]:0.95. I'm sure this could also be improved with more epochs and additional tweaks, but at first glance all is good here.

DDP train command:

python train.py --data Objects365.yaml --batch 224 --weights --cfg yolov5m.yaml --epochs 30 --img 640 --hyp hyp.scratch-low.yaml --device 0,1,2,3,4,5,6

Results

# YOLOv5m v6.0 COCO 300 epochs
                 all       5000      36335      0.726      0.569      0.633      0.439
              person       5000      10777      0.792      0.735      0.807      0.554

# YOLOv5m v6.0 Objects365 30 epochs
                 all      80000    1239576      0.626      0.265      0.273      0.185
              Person      80000      80332      0.599      0.765      0.759       0.57

Thank you for your wonderful work.
Please tell me, you are using 8 GPUs? the model is v100 32G memory? How long is the training time?

@glenn-jocher
Copy link
Member

@ahong007007 yes we used an AWS P4d instance with 8 A100s with DPP for Objects365 training. For 30 epochs of YOLOv5m it was pretty fast, about 1.5 days. Training command in #5194 (comment)

@glenn-jocher glenn-jocher mentioned this pull request Apr 7, 2022
1 task
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
…ytics#5194)

* Update Objects365.yaml

Download the official Objects365 validation set and convert the labels

* Enforce 4-space indent, reformat and cleanup

* shorten list comprehension

Co-authored-by: Glenn Jocher <[email protected]>
@sibozhang
Copy link

sibozhang commented Aug 11, 2023

@farleylai I trained a YOLOv5m model on Objects365 following this PR and the other related fixes. Everything works well. [email protected]:0.95 was only 18.5 after 30 epochs, but person mAP was similar to COCO, about 55 [email protected]:0.95. I'm sure this could also be improved with more epochs and additional tweaks, but at first glance all is good here.

DDP train command:

python train.py --data Objects365.yaml --batch 224 --weights --cfg yolov5m.yaml --epochs 30 --img 640 --hyp hyp.scratch-low.yaml --device 0,1,2,3,4,5,6

Results

# YOLOv5m v6.0 COCO 300 epochs
                 all       5000      36335      0.726      0.569      0.633      0.439
              person       5000      10777      0.792      0.735      0.807      0.554

# YOLOv5m v6.0 Objects365 30 epochs
                 all      80000    1239576      0.626      0.265      0.273      0.185
              Person      80000      80332      0.599      0.765      0.759       0.57

According to https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training/#faq.
Should we train DDP using python -m torch.distributed.run --nproc_per_node 8 train.py --data Objects365.yaml --weights yolov5m.pt --batch 128 --freeze 10 --device 0,1,2,3,4,5,6,7 --epochs 200 --hyp hyp.scratch-low.yaml?

Thanks! @glenn-jocher

@glenn-jocher
Copy link
Member

@sibozhang training using DDP with multiple GPUs can be done using the torch.distributed.run module. You can use the following command as a template for training with 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --data Objects365.yaml --weights yolov5m.pt --batch 128 --freeze 10 --device 0,1,2,3,4,5,6,7 --epochs 200 --hyp hyp.scratch-low.yaml

This command will distribute the training across the specified GPUs. Adjust the batch size, number of epochs, and other parameters as desired. Good luck with your training!

@sibozhang
Copy link

sibozhang commented Aug 12, 2023

@sibozhang training using DDP with multiple GPUs can be done using the torch.distributed.run module. You can use the following command as a template for training with 8 GPUs:

python -m torch.distributed.run --nproc_per_node 8 train.py --data Objects365.yaml --weights yolov5m.pt --batch 128 --freeze 10 --device 0,1,2,3,4,5,6,7 --epochs 200 --hyp hyp.scratch-low.yaml

This command will distribute the training across the specified GPUs. Adjust the batch size, number of epochs, and other parameters as desired. Good luck with your training!

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808113 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806776 milliseconds before timing out.

I cannot start to train Objects365 on A100x8 40GB since NCCL timeout. I also tried add cache

python -m torch.distributed.run --nproc_per_node 8 train.py --data Objects365.yaml --weights runs/train/exp90/weights/best.pt --batch 128 --workers 8 --freeze 10 --img 640 --device 0,1,2,3,4,5,6,7 --epochs 100 --cache

How to change NCCL timeout settings? @glenn-jocher

@glenn-jocher
Copy link
Member

@sibozhang it looks like you're encountering NCCL timeout issues during training on A100x8 40GB. You can try adjusting the NCCL timeout settings using the NCCL_SOCKET_IFNAME environment variable to select specific network interfaces for interprocess communication. Additionally, you can modify NCCL's timeout threshold using the NCCL_DEBUG=INFO environment variable to print more detailed information about NCCL's operation.

If the issue persists, please refer to the official documentation for NVIDIA NCCL or reach out to the NVIDIA support channels for further assistance. Good luck with your training!

@p2p-sys
Copy link

p2p-sys commented Jun 6, 2024

@glenn-jocher Are there plans to release new versions of Objects365 models for new versions of Yolo? I have made a room classifier https://github.com/p2p-sys/yolo5-classificator using COCO and Objects365 models, I would like to use the new versions of Yolo. Unfortunately Objects365 could not be trained on its own, in the process of training the programme is killed by the system

@glenn-jocher
Copy link
Member

Hello @p2p-sys! We're always working on improving and updating our models, including those trained on different datasets like Objects365. Keep an eye on our GitHub releases for any updates on new versions of YOLO trained with Objects365.

Regarding the issue with training being killed, it might be related to system resource limitations. Ensure you have sufficient memory and processing power, or consider reducing the batch size or using a simpler model. If the problem persists, please open an issue with detailed logs and system specs for further assistance.

Thank you for using YOLOv5 for your room classifier project! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants