RuntimeError: CUDA error: device-side assert triggered #2124

hdnh2006 · 2021-02-03T14:26:51Z

🐛 Bug

Hi! I am trying to train yolo into my own dataset.

It apparently runs the first epoch correctly but when it is going to evaluate the valid set, it fails, giving to me an error apparently related with CUDA but when you see the logs it seems the problem is with the boxes in the general.py code.

At the beginning I thought the problem was I didn't have the last commit cloned, so I created a new virtualenv and cloned the last repo but the error was still there.

Then I modified the batch size to 2 and the error was the same.

Could you help me to fix this issue?

To Reproduce (REQUIRED)

Input:

python train.py --weights yolov5s.pt --cfg models/yolov5s.yaml --data my_dataset/data.yaml --epochs 300 --batch-size 16 --cache-images --workers 12 --project my_project/train/

Output:

Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
     0/299     4.75G   0.07158   0.05279   0.03193    0.1563        69       640: 100%|█| 867/867 [02:49<00:00,  5.11i
               Class      Images     Targets           P           R      [email protected]  [email protected]:.95:   3%| | 3/111 [00:00<00:
Traceback (most recent call last):
  File "train.py", line 522, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 340, in train
    results, maps, times = test.test(opt.data,
  File "/home/henry/Projects/yolo/yolov5torch1.7/test.py", line 114, in test
    loss += compute_loss([x.float() for x in train_out], targets)[1][:3]  # box, obj, cls
  File "/home/henry/Projects/yolo/yolov5torch1.7/utils/loss.py", line 133, in __call__
    iou = bbox_iou(pbox.T, tbox[i], x1y1x2y2=False, CIoU=True)  # iou(prediction, target)
  File "/home/henry/Projects/yolo/yolov5torch1.7/utils/general.py", line 272, in bbox_iou
    b1_y1, b1_y2 = box1[1] - box1[3] / 2, box1[1] + box1[3] / 2
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [101,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [13,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [88,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [89,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Environment

OS: Ubuntu 20.04
GPU RTX 2070 Super
CUDA 11.2

1st Update

I read to use CUDA_LAUNCH_BLOCKING="1" before python train.py in order to get the CUDA logs and these are the logs I am getting for:

Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
     0/299     4.75G   0.07159   0.05279   0.03194    0.1563        69       640: 100%|█| 867/867 [03:52<00:00,  3.74i
               Class      Images     Targets           P           R      [email protected]  [email protected]:.95:   3%| | 3/111 [00:00<00:/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [100,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [101,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [13,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [50,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [88,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [89,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
               Class      Images     Targets           P           R      [email protected]  [email protected]:.95:   3%| | 3/111 [00:00<00:
Traceback (most recent call last):
  File "train.py", line 522, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 340, in train
    results, maps, times = test.test(opt.data,
  File "/home/henry/Projects/yolo/yolov5torch1.7/test.py", line 114, in test
    loss += compute_loss([x.float() for x in train_out], targets)[1][:3]  # box, obj, cls
  File "/home/henry/Projects/yolo/yolov5torch1.7/utils/loss.py", line 142, in __call__
    t[range(n), tcls[i]] = self.cp
RuntimeError: CUDA error: device-side assert triggered

2nd Update

Due to the problem was while trying to evaluate the valid dataset, I added '-notest' to the command and now I don't receive any output, it seems it is still working, the memory of my GPU increased from 4GB to 6GB in this step but the percentage of use went to almost 0:

Starting training for 300 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
     0/299     4.75G   0.07158    0.0528   0.03193    0.1563        69       640: 100%|█| 867/867 [03:41<00:00,  3.91i

The text was updated successfully, but these errors were encountered:

hdnh2006 · 2021-02-03T17:16:44Z

False Alarm!

It seems one or more of my test images were out of index, that means the labels were wrong, may be located out of the image.

It would be good you add to the code something like check_labels() or something like that as you have done with check_github and check_requirements in the newest versions.

Anyway, sorry for open this issue, completely my bad.

glenn-jocher · 2021-02-03T19:47:25Z

@hdnh2006 that's interesting. There's a comprehensive set of tests the labels and images are required to pass before they are included in the train or val sets. You can find these here. If you can determine how your incorrect labels passed these checks we can update them or add an additional check:

yolov5/utils/datasets.py

Lines 441 to 486 in 73a0669

    
           def cache_labels(self, path=Path('./labels.cache'), prefix=''): 
        
               # Cache dataset labels, check images and read shapes 
        
               x = {}  # dict 
        
               nm, nf, ne, nc = 0, 0, 0, 0  # number missing, found, empty, duplicate 
        
               pbar = tqdm(zip(self.img_files, self.label_files), desc='Scanning images', total=len(self.img_files)) 
        
               for i, (im_file, lb_file) in enumerate(pbar): 
        
                   try: 
        
                       # verify images 
        
                       im = Image.open(im_file) 
        
                       im.verify()  # PIL verify 
        
                       shape = exif_size(im)  # image size 
        
                       assert (shape[0] > 9) & (shape[1] > 9), f'image size {shape} <10 pixels' 
        
                       assert im.format.lower() in img_formats, f'invalid image format {im.format}' 
        
                       # verify labels 
        
                       if os.path.isfile(lb_file): 
        
                           nf += 1  # label found 
        
                           with open(lb_file, 'r') as f: 
        
                               l = np.array([x.split() for x in f.read().strip().splitlines()], dtype=np.float32)  # labels 
        
                           if len(l): 
        
                               assert l.shape[1] == 5, 'labels require 5 columns each' 
        
                               assert (l >= 0).all(), 'negative labels' 
        
                               assert (l[:, 1:] <= 1).all(), 'non-normalized or out of bounds coordinate labels' 
        
                               assert np.unique(l, axis=0).shape[0] == l.shape[0], 'duplicate labels' 
        
                           else: 
        
                               ne += 1  # label empty 
        
                               l = np.zeros((0, 5), dtype=np.float32) 
        
                       else: 
        
                           nm += 1  # label missing 
        
                           l = np.zeros((0, 5), dtype=np.float32) 
        
                       x[im_file] = [l, shape] 
        
                   except Exception as e: 
        
                       nc += 1 
        
                       print(f'{prefix}WARNING: Ignoring corrupted image and/or label {im_file}: {e}') 
        
                   pbar.desc = f"{prefix}Scanning '{path.parent / path.stem}' for images and labels... " \ 
        
                               f"{nf} found, {nm} missing, {ne} empty, {nc} corrupted" 
        
               if nf == 0: 
        
                   print(f'{prefix}WARNING: No labels found in {path}. See {help_url}') 
        
               x['hash'] = get_hash(self.label_files + self.img_files) 
        
               x['results'] = [nf, nm, ne, nc, i + 1] 
        
               torch.save(x, path)  # save for next time 
        
               logging.info(f'{prefix}New cache created: {path}') 
        
               return x

glenn-jocher · 2021-02-03T19:49:38Z

@hdnh2006 the label checks are here (L461-L464), they should prevent any negative labels or labels with box values > 1:

                     assert l.shape[1] == 5, 'labels require 5 columns each' 
                     assert (l >= 0).all(), 'negative labels' 
                     assert (l[:, 1:] <= 1).all(), 'non-normalized or out of bounds coordinate labels' 
                     assert np.unique(l, axis=0).shape[0] == l.shape[0], 'duplicate labels'

hdnh2006 · 2021-02-04T11:36:16Z

Thanks @glenn-jocher, I saw it, your code is fantastic, elegant and easy to understand.

I did not explain myself well, my problem was not exactly with the label, I mean class 0, 1, 2, 3... but with the square itself.

I downloaded a dataset from internet and it seems that there were some images where the coordinates of the boxes matched with pixels that were outside of the image (just for little decimals), which caused all the controls of your code passed correctly but then it looked for a pixel that did not belong to the image in some step and skipped this error that it had nothing to do with the message being returned (CUDA or something like that).

This kind of control would be awesome if it is incorporated to your code but I think it is not a very common error.

I don't know if I explained good, let me know if you have any doubt and thanks again for this fantastic tool you have created.

Best,

H.

glenn-jocher · 2021-02-04T19:54:44Z

@hdnh2006 non-normalized or out of bounds coordinate labels will cause the entire image and label to fail it's check, and this image will not be included in training. These checks are run in xywh image space, are you suggesting that we should also run the checks in a seperate image space such as xyxy?

assert (l >= 0).all(), 'negative labels'
assert (l[:, 1:] <= 1).all(), 'non-normalized or out of bounds coordinate labels'

hdnh2006 added the bug Something isn't working label Feb 3, 2021

hdnh2006 closed this as completed Feb 3, 2021

This was referenced Apr 11, 2021

YOLOv5 v5.0 Release #2762

Merged

YOLOv5 v5.0 release compatibility update for YOLOv3 ultralytics/yolov3#1737

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: device-side assert triggered #2124

RuntimeError: CUDA error: device-side assert triggered #2124

hdnh2006 commented Feb 3, 2021 •

edited

Loading

hdnh2006 commented Feb 3, 2021

glenn-jocher commented Feb 3, 2021

glenn-jocher commented Feb 3, 2021

hdnh2006 commented Feb 4, 2021

glenn-jocher commented Feb 4, 2021 •

edited

Loading

RuntimeError: CUDA error: device-side assert triggered #2124

RuntimeError: CUDA error: device-side assert triggered #2124

Comments

hdnh2006 commented Feb 3, 2021 • edited Loading

🐛 Bug

To Reproduce (REQUIRED)

Environment

1st Update

2nd Update

hdnh2006 commented Feb 3, 2021

glenn-jocher commented Feb 3, 2021

glenn-jocher commented Feb 3, 2021

hdnh2006 commented Feb 4, 2021

glenn-jocher commented Feb 4, 2021 • edited Loading

hdnh2006 commented Feb 3, 2021 •

edited

Loading

glenn-jocher commented Feb 4, 2021 •

edited

Loading