Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autofix duplicate label handling #5210

Merged
merged 5 commits into from
Oct 15, 2021
Merged

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Oct 15, 2021

Improved duplicate label handling from report error and ignore image-label pair to report warning and autofix image-label pair. Also improves problematic label introspection with more detailed error reporting.

This should fix this common issue for users and allow everyone to get started and get a model trained faster and easier than before. Example application is Objects365 autodownload #5194, which reports numerous duplicate labels in dataset:

train: Scanning '../datasets/Objects365/labels/train' images and labels...1742289 found, 0 missing, 0 empty, 111771 corrupted: 100%|███████| 1742289/1742289 [09:21<00:00, 3101.89it/s]
val: Scanning '../datasets/Objects365/labels/val' images and labels...80000 found, 0 missing, 0 empty, 7312 corrupted: 100%|███████| 80000/80000 [00:06<00:00, 12564.88it/s]

I also see a significant number of 'corrupted' messages due to duplicate labels. This is just two identical rows (same class, same exact coordinates). We handle these as errors rather than warnings, so the image will be rejected if any of these occur:

train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051762.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051841.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051849.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051921.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00051929.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052000.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052013.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052022.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052046.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052076.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052150.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052158.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052161.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052200.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052205.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052225.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052235.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052255.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052318.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052417.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052420.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052444.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052472.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052486.jpg: non-normalized or out of bounds coordinate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052490.jpg: duplicate labels
train: WARNING: Ignoring corrupted image and/or label ../datasets/Objects365/images/train/objects365_v1_00052528.jpg: duplicate labels

Originally posted by @glenn-jocher in #5194 (comment)

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Updates to image and label verification plus cache version increment in YOLOv5 dataset handling.

📊 Key Changes

  • 🔄 Incremented the .cache version from 0.5 to 0.6 for dataset labels.
  • 🛠 Improved warning messages for corrupt JPEG images to include the image file path.
  • 🔍 Added more detailed assertions and error messages during label verification to help identify the specific issues.
  • 🧹 Implemented automatic removal of duplicate labels and a corresponding warning message if duplicates are found and removed.
  • 📝 Updated warning messages to include the image file path when identifying corrupt images or labels during exception handling.

🎯 Purpose & Impact

  • 📈 The .cache version increment ensures users employ the latest caching mechanism, potentially enhancing performance or compatibility.
  • 📊 More informative error messages aid in quicker debugging and fixing of dataset issues by clearly pointing out the specific problems.
  • ♻️ Automatic duplicate label removal ensures dataset integrity with less manual intervention, leading to cleaner training data.
  • 🚀 Enhancing the robustness of the data loading pipeline has the potential to improve the overall user experience by reducing errors and simplifying dataset maintenance.

PR changes duplicate label handling from report error and ignore image-label pair to report warning and autofix image-label pair. 

This should fix this common issue for users and allow everyone to get started and get a model trained faster and easier than before.
@glenn-jocher glenn-jocher changed the title Autofix duplicate labels Autofix duplicate label handling Oct 15, 2021
@glenn-jocher
Copy link
Member Author

glenn-jocher commented Oct 15, 2021

Example error/warning handling with improved introspection for problematic labels:

train: Scanning '../datasets/coco128/labels/train2017' images and labels...128 found, 0 missing, 2 empty, 4 corrupted: 100%|██████████| 128/128 [00:01<00:00, 68.05it/s]
train: WARNING: ../datasets/coco128/images/train2017/000000000064.jpg: 3 duplicate labels removed
train: WARNING: ../datasets/coco128/images/train2017/000000000089.jpg: ignoring corrupt image/label: negative label values [  -0.042141]
train: WARNING: ../datasets/coco128/images/train2017/000000000144.jpg: ignoring corrupt image/label: non-normalized or out of bounds coordinates [      1.537]
train: WARNING: ../datasets/coco128/images/train2017/000000000328.jpg: ignoring corrupt image/label: negative label values [   -0.91823]
train: WARNING: ../datasets/coco128/images/train2017/000000000419.jpg: ignoring corrupt image/label: non-normalized or out of bounds coordinates [     1.8188]
train: New cache created: ../datasets/coco128/labels/train2017.cache

Most importantly, images with duplicate labels are no longer ignored :)

@glenn-jocher glenn-jocher self-assigned this Oct 15, 2021
@glenn-jocher glenn-jocher added the enhancement New feature or request label Oct 15, 2021
@glenn-jocher glenn-jocher merged commit 991c654 into master Oct 15, 2021
@glenn-jocher glenn-jocher deleted the update/duplicate_labels branch October 15, 2021 19:32
BjarneKuehl pushed a commit to fhkiel-mlaip/yolov5 that referenced this pull request Aug 26, 2022
* Autofix duplicate labels

PR changes duplicate label handling from report error and ignore image-label pair to report warning and autofix image-label pair. 

This should fix this common issue for users and allow everyone to get started and get a model trained faster and easier than before.

* sign fix

* Cleanup

* Increment cache version

* all to any fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant