Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training crashing for instance segmentation at U7 #795

Open
alrightkami opened this issue Sep 15, 2022 · 9 comments
Open

Training crashing for instance segmentation at U7 #795

alrightkami opened this issue Sep 15, 2022 · 9 comments

Comments

@alrightkami
Copy link

I'm trying to start a training on the u7 branch for Instance Segmentation, but getting an error and can't figure out what it refers to.


Starting training for 300 epochs...
      Epoch    GPU_mem   box_loss   seg_loss   obj_loss   cls_loss  Instances       Size
  0%|          | 0/126 [00:00<?, ?it/s]                                         
Traceback (most recent call last):
  File "/home/data/yolov7/seg/segment/train.py", line 681, in <module>
    main(opt)
  File "/home/data/yolov7/seg/segment/train.py", line 577, in main
    train(opt.hyp, opt, device, callbacks)
  File "/home/data/yolov7/seg/segment/train.py", line 295, in train
    for i, (imgs, targets, paths, _, masks) in pbar:  # batch ------------------------------------------------------
  File "/opt/conda/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/home/data/yolov7/seg/utils/dataloaders.py", line 171, in __iter__
    yield next(self.iterator)
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
    return self._process_data(data)
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
    data.reraise()
  File "/opt/conda/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
    raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/data/yolov7/seg/utils/segment/dataloaders.py", line 116, in __getitem__
    img, labels, segments = mixup(img, labels, segments, *self.load_mosaic(random.randint(0, self.n - 1)))
  File "/home/data/yolov7/seg/utils/segment/augmentations.py", line 21, in mixup
    segments = np.concatenate((segments, segments2), 0)
  File "<__array_function__ internals>", line 180, in concatenate
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 3 dimension(s) and the array at index 1 has 1 dimension(s)


The script:
!cd data/yolov7/seg && python segment/train.py --data data/graffiti.yaml --batch 16 --weights yolov7-seg.pt --cfg yolov7-seg.yaml --epochs 300 --name yolov7-seg --img 640 --hyp hyp.scratch-high.yaml

To generate data I used this script.

@aqsc
Copy link

aqsc commented Sep 16, 2022

you should use segmentation labels instead of detection labels.

@alrightkami
Copy link
Author

@aqsc I do already, I converted them from COCO format with the tool mentioned above suggested by WongKinYiu. This is an example of my label file:

0 0.00394477 0.00758663 0.00760355 0.00758663 0.0515335 0.00605611 0.10645 0.00605611 0.168683 0.00605611 0.205291 0.0106518 0.245562 0.0152434 0.273018 0.0167781 0.315118 0.0121823 0.371859 0.00299505 0.41396 0.0167781 0.434093 0.0351526 0.439586 0.0719059 0.439586 0.108659 0.42677 0.120912 0.377352 0.120912 0.3261 0.108659 0.30858 0.0872112 0.282002 0.0993399 0.227253 0.105598 0.190646 0.0902847 0.174172 0.0765017 0.154038 0.0795627 0.110108 0.0918152 0.0606854 0.08875 0.0368935 0.0872195 0.0222485 0.0826279 0.000281065 0.0657797

@prateekgml
Copy link

@alrightkami I am able to do the custom instance segmentation model training. Maybe this post can help-
https://dsbyprateekg.blogspot.com/2022/09/how-to-train-custom-dataset-with-yolov7.html

@dilpreetsingh
Copy link

I'm getting the same crash, but mine occurs somewhat randomly. Here the crash occurred on the 8th epoch:

Epoch    GPU_mem   box_loss   seg_loss   obj_loss   cls_loss  Instances       Size
      8/299      7.44G    0.04525    0.02051    0.02537     0.0155         33        640:  14%|█▍
Traceback (most recent call last):
  File "segment/train.py", line 681, in <module>
    main(opt)
  File "segment/train.py", line 577, in main
    train(opt.hyp, opt, device, callbacks)
  File "segment/train.py", line 295, in train
    for i, (imgs, targets, paths, _, masks) in pbar:  # batch -------------------------------------------
-----------
  File "/home/dilpreet/anaconda3/envs/yolov7/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __it
er__
    for obj in iterable:
  File "/ssd/home/dilpreet/Documents/YOLOv7Seg/yolov7/seg/utils/dataloaders.py", line 171, in __iter__  
    yield next(self.iterator)
  File "/home/dilpreet/anaconda3/envs/yolov7/lib/python3.8/site-packages/torch/utils/data/dataloader.py",
 line 530, in __next__
    data = self._next_data()
  File "/home/dilpreet/anaconda3/envs/yolov7/lib/python3.8/site-packages/torch/utils/data/dataloader.py",
 line 1204, in _next_data
    return self._process_data(data)
  File "/home/dilpreet/anaconda3/envs/yolov7/lib/python3.8/site-packages/torch/utils/data/dataloader.py",
 line 1250, in _process_data
    data.reraise()
  File "/home/dilpreet/anaconda3/envs/yolov7/lib/python3.8/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
ValueError: Caught ValueError in DataLoader worker process 4.
Original Traceback (most recent call last):
  File "/home/dilpreet/anaconda3/envs/yolov7/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/dilpreet/anaconda3/envs/yolov7/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/dilpreet/anaconda3/envs/yolov7/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/ssd/home/dilpreet/Documents/YOLOv7Seg/yolov7/seg/utils/segment/dataloaders.py", line 116, in __getitem__
    img, labels, segments = mixup(img, labels, segments, *self.load_mosaic(random.randint(0, self.n - 1)))
  File "/ssd/home/dilpreet/Documents/YOLOv7Seg/yolov7/seg/utils/segment/augmentations.py", line 21, in mixup
    segments = np.concatenate((segments, segments2), 0)
  File "<__array_function__ internals>", line 180, in concatenate
ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 3 dimension(s) and the array at index 1 has 1 dimension(s)

@Nikunj2696
Copy link

@dilpreetsingh Can you share which tool you used for annotation? I am quite sure that the issue is about annotation.

@dilpreetsingh
Copy link

@Nikunj2696 I wrote my own polygon conversion script because my data is in PascalVOC and I couldn't find anything that would go directly from PascalVOC to Darknet. As I understand it, the format is simply:

label_id, x1, y1, x2, y2, ..., xn, yn

It seems to do great for a certain number of epochs but has trouble with the mixup augmentation randomly. I did an experiment where I set the mixup probability to 0.0, and that worked perfectly. My current theory is that potentially the mixup augmentation doesn't handle images with 0 annotations/segments (I have some no-annotation images in the dataset).

@Nikunj2696
Copy link

@Nikunj2696 I wrote my own polygon conversion script because my data is in PascalVOC and I couldn't find anything that would go directly from PascalVOC to Darknet. As I understand it, the format is simply:

label_id, x1, y1, x2, y2, ..., xn, yn

It seems to do great for a certain number of epochs but has trouble with the mixup augmentation randomly. I did an experiment where I set the mixup probability to 0.0, and that worked perfectly. My current theory is that potentially the mixup augmentation doesn't handle images with 0 annotations/segments (I have some no-annotation images in the dataset).

@dilpreetsingh Please use https://roboflow.com/ for annotation. I faced the same issue but solved it. This annotation tool helps you.

@rrichards7
Copy link

rrichards7 commented Sep 27, 2022

I ran into this problem when using the "hyp.scratch-high.yaml" hyperparameter config file which has mixup enabled.
I think the crashing is happening "randomly" since mixup literally has a random probability of being selected as the augmentation to use for a particular step.

I would suggest using hyp.scratch-low.yaml to resolve this problem temporarily if mixup is not desirable
Or keep using hyp.scratch-high.yaml but set mixup to 0.0

@Joe-KI333
Copy link

Joe-KI333 commented Sep 29, 2022

Learn to train from custom dataset for instance segmentation Read this blog

https://medium.com/augmented-startups/yolov7-segmentation-on-crack-using-roboflow-dataset-f13ae81b9958

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants