Training Loss : Nan #79

xiaoxingzeng · 2018-02-23T14:14:15Z

My training loss always becomes NAN when the iteteration comes to several hundred iters.
All parameters are default. My training dataset is useable for py-faster rcnn, and copy to faster-rcnn.pytorch
directory.
My training command is : python trainval_net.py --dataset pascal_voc --net vgg16 --bs 1 --lr 0.001 --cuda

There are some advices for that? TKS

xiaoxingzeng · 2018-02-23T14:16:16Z

This is my training print-out
[session 1][epoch 1][iter 0] loss: 6.3588, lr: 1.00e-03
fg/bg=(20/236), time cost: 1.659386
rpn_cls: 0.8163, rpn_box: 4.6926, rcnn_cls: 0.8488, rcnn_box 0.0010
[session 1][epoch 1][iter 100] loss: 1.0697, lr: 1.00e-03
fg/bg=(24/232), time cost: 33.015444
rpn_cls: 0.1404, rpn_box: 0.6121, rcnn_cls: 0.2425, rcnn_box 0.1688
[session 1][epoch 1][iter 200] loss: 0.7961, lr: 1.00e-03
fg/bg=(43/213), time cost: 33.076333
rpn_cls: 0.1488, rpn_box: 1.1833, rcnn_cls: 0.3630, rcnn_box 0.2185
[session 1][epoch 1][iter 300] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 33.628527
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 400] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 32.910808
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 500] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 32.843017
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 32.721040
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 700] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 33.876777
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 800] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 33.819963
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan

JavHaro · 2018-03-01T12:50:01Z

Hi @xiaoxingzeng ,
I have the same problem but sadly i still have no answer. Are you using a custom dataset? Maybe some answers of this issue can help you.
Good luck!

zeehasham · 2018-03-09T23:31:16Z

@JavHaro @xiaoxingzeng I am having the same issue of nan values with custom dataset. Did you find the solution? Thanks

yuanyao366 · 2018-03-14T00:57:36Z

I have the same probelm with my own dataset. Did you find the solution? Thanks a lot!

[session 1][epoch 1][iter 5400/33021] loss: 0.3792, lr: 1.00e-03
fg/bg=(113/399), time cost: 129.585304
rpn_cls: 0.0753, rpn_box: 0.2207, rcnn_cls: 0.1785, rcnn_box 0.2806
[session 1][epoch 1][iter 5500/33021] loss: 0.3538, lr: 1.00e-03
fg/bg=(42/470), time cost: 129.476077
rpn_cls: 0.0916, rpn_box: 0.1274, rcnn_cls: 0.0800, rcnn_box 0.1175
[session 1][epoch 1][iter 5600/33021] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 122.264345
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 5700/33021] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 119.232911
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan

jwyang · 2018-03-14T18:19:09Z

Hi, all, If possible, it would be good to share your data loaders with us, so that we can check why that happened.

yuanyao366 · 2018-03-15T07:50:09Z

@jwyang I use caltech dataset for pedestrian detection. When training, I print num_boxes and gt_boxes under "print loss" like this:
print("[session %d][epoch %2d][iter %4d] loss: %.4f, lr: %.2e"
% (args.session, epoch, step, loss_temp, lr))
print("\t\t\tfg/bg=(%d/%d), time cost: %f" % (fg_cnt, bg_cnt, end-start))
print("\t\t\trpn_cls: %.4f, rpn_box: %.4f, rcnn_cls: %.4f, rcnn_box %.4f"
% (loss_rpn_cls, loss_rpn_box, loss_rcnn_cls, loss_rcnn_box))

    gt_boxes_cpu=gt_boxes.cpu().data.numpy()
    im_data_cpu=im_data.cpu().data
    for i in range(args.batch_size):
        num_gt=num_boxes.data[i]
        print("the %dth image have num_boxes: %d and the gt_boxes are:"%(i,num_gt))
        print(gt_boxes_cpu[i][:num_gt])
        img=im_data_cpu[i].permute(1,2,0).numpy()
        #_vis_minibatch(img, gt_boxes_cpu[i][:num_gt])

I even visualize the image and gt_boxes like this:
def _vis_minibatch(im_blob, rois_blob):
"""Visualize a mini-batch for debugging."""
import matplotlib.pyplot as plt
for i in xrange(rois_blob.shape[0]):
rois = rois_blob[i, :]
#im_ind = rois[0]
roi = rois[:4]
im = im_blob[:, :, :].copy()
im += cfg.PIXEL_MEANS
im = im[:, :, (2, 1, 0)]
im = im.astype(np.uint8)
plt.imshow(im)
plt.gca().add_patch(
plt.Rectangle((roi[0], roi[1]), roi[2] - roi[0],
roi[3] - roi[1], fill=False,
edgecolor='r', linewidth=3)
)
plt.show()

yuanyao366 · 2018-03-15T08:27:53Z

My training command is : python trainval_net.py --dataset caltech --net vgg16 --bs 2 --gpu 0 --cuda
This is my training print-out：

[session 1][epoch 1][iter 100] loss: 0.6892, lr: 1.00e-03
fg/bg=(19/493), time cost: 125.878273
rpn_cls: 0.2557, rpn_box: 0.4797, rcnn_cls: 0.1704, rcnn_box 0.0634
the 0th image have num_boxes: 3 and the gt_boxes are:
[[ 465. 215. 496.25 271.25 1. ]
[ 536.25 206.25 561.25 255. 1. ]
[ 168.75 216.25 218.75 303.75 1. ]]
the 1th image have num_boxes: 3 and the gt_boxes are:
[[ 548.75 210. 550. 238.75 1. ]
[ 175. 216.25 227.5 303.75 1. ]
[ 465. 215. 496.25 271.25 1. ]]
[session 1][epoch 1][iter 200] loss: 0.5355, lr: 1.00e-03
fg/bg=(20/492), time cost: 124.361140
rpn_cls: 0.2848, rpn_box: 0.4464, rcnn_cls: 0.1204, rcnn_box 0.0524
the 0th image have num_boxes: 6 and the gt_boxes are:
[[ 388.75 207.5 402.5 253.75 1. ]
[ 511.25 193.75 523.75 232.5 1. ]
[ 327.5 195. 345. 242.5 1. ]
[ 493.75 193.75 512.5 238.75 1. ]
[ 407.5 195. 425. 252.5 1. ]
[ 210. 190. 236.25 253.75 1. ]]
the 1th image have num_boxes: 5 and the gt_boxes are:
[[ 450. 200. 451.25 251.25 1. ]
[ 506.25 206.25 523.75 250. 1. ]
[ 258.75 202.5 273.75 250. 1. ]
[ 497.5 201.25 512.5 246.25 1. ]
[ 422.5 211.25 435. 252.5 1. ]]
[session 1][epoch 1][iter 300] loss: 0.4500, lr: 1.00e-03
fg/bg=(4/508), time cost: 124.083962
rpn_cls: 0.1161, rpn_box: 0.0771, rcnn_cls: 0.0267, rcnn_box 0.0117
the 0th image have num_boxes: 1 and the gt_boxes are:
[[ 720. 243.75 743.75 303.75 1. ]]
the 1th image have num_boxes: 1 and the gt_boxes are:
[[ 720. 243.75 743.75 303.75 1. ]]
[session 1][epoch 1][iter 400] loss: 0.4334, lr: 1.00e-03
fg/bg=(3/509), time cost: 123.896316
rpn_cls: 0.2213, rpn_box: 0.0603, rcnn_cls: 0.0627, rcnn_box 0.0038
the 0th image have num_boxes: 1 and the gt_boxes are:
[[ 22.5 225. 45. 261.25 1. ]]
the 1th image have num_boxes: 1 and the gt_boxes are:
[[ 28.75 230. 37.5 257.5 1. ]]
[session 1][epoch 1][iter 500] loss: 0.4303, lr: 1.00e-03
fg/bg=(19/493), time cost: 124.256176
rpn_cls: 0.1117, rpn_box: 0.2408, rcnn_cls: 0.0595, rcnn_box 0.0675
the 0th image have num_boxes: 3 and the gt_boxes are:
[[ 178.75 205. 208.75 277.5 1. ]
[ 541.25 217.5 558.75 257.5 1. ]
[ 595. 210. 618.75 268.75 1. ]]
the 1th image have num_boxes: 3 and the gt_boxes are:
[[ 183.75 206.25 213.75 277.5 1. ]
[ 592.5 208.75 616.25 267.5 1. ]
[ 541.25 217.5 558.75 257.5 1. ]]
[session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 124.420586
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
the 0th image have num_boxes: 1 and the gt_boxes are:
[[ 408.75 225. 427.5 247.5 1. ]]
the 1th image have num_boxes: 1 and the gt_boxes are:
[[ 380. 233.75 398.75 256.25 1. ]]
[session 1][epoch 1][iter 700] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 124.281587
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
the 0th image have num_boxes: 1 and the gt_boxes are:
[[ 610. 176.25 638.75 248.75 1. ]]
the 1th image have num_boxes: 1 and the gt_boxes are:
[[ 606.25 176.25 635. 248.75 1. ]]

I think the im_data and gt_boxes are right before being sent into the net, but when training, the "fg/bg" is abnormal and the number of "fg" is too small until the"fg/bg" is "512/0". I have been confused by this problem for one week and I hope I can get some useful advice as soon as possible.
Thanks a lot!

jwyang · 2018-03-15T15:28:00Z

Hi, @sdlyaoyuan , I will take a look at this problem, and try to solve it as soon as possible.

xuelin-chen · 2018-03-19T03:27:29Z

Hi, @jwyang , I had the same issue these day too when I tried to use my customized dataset to train. The same thing (NAN) happened when 'fg/bg' is 1024/0. It seems like NAN occurs when there are no bg samples during the rois sampling. Maybe the code never steped into this case when using dataset like COCO?

Thanks a lot!:)

xuelin-chen · 2018-03-19T05:12:04Z

This is abnormal I guess... once the bg=0 occurs, it keeps happening to all the following batches...

JavHaro · 2018-03-19T08:17:36Z

Hi @jwyang , @ChenXuelinCXL, and @sdlyaoyuan
I just posted in other thread that i have located the problem. In my case, the problem is in the annotations loading. I don't know why, when you load the minimum values of the annotation (xmin & y min) if they are close to 0 it loads 65534 (the maximum value minus 2) so when you manage the areas and calculate xmax-xmin the value is negative. I solve it checking the minimum values after loading.
I hope this could help you.

xuelin-chen · 2018-03-19T08:28:32Z

Hi @JavHaro , thanks for the help, I will see if that is my case.:)

BTW, I am wondering if any one of you had your code entered these 'elif' when training?

yuanyao366 · 2018-03-19T09:14:12Z

@ChenXuelinCXL If you make your customized dataset as the format of pascal voc , you can try make such change in pascal_voc.py:
for ix, obj in enumerate(objs):
bbox = obj.find('bndbox')
# Make pixel indexes 0-based
x1 = float(bbox.find('xmin').text) #- 1
y1 = float(bbox.find('ymin').text) #- 1
x2 = float(bbox.find('xmax').text) #- 1
y2 = float(bbox.find('ymax').text) #- 1

xuelin-chen · 2018-03-19T15:37:36Z

OK.. after I tried several times. I double checked my code to make the bounding box coordinations all correct. It is still producing NAN. The bg=0 still happens.

JavHaro · 2018-03-20T11:57:29Z

Hi @ChenXuelinCXL
Where did you check the bounding box coordinates? I say it because depending on the part of the code where you check them the value to check will be different. Anyway, you can check if xmin<xmax or ymin<ymax in any part, just to locate the problem.
KR

xuelin-chen · 2018-03-20T13:39:06Z

@JavHaro I am still locating the bug. I am very sure that the bounding box from myd data is correct. Now I got this:

I print the rois output from RPN, this is what causes number of bg = 0, all proposals from RPN are almost zero boxes, except the first few wired boxes.

Trying to figure out what causes this. Note I add a filter to assign zero scores for those very small proposal boxes in RPN, but RPN still outputs them, that measn all proposals in RPN are almost zero boxes!?

lolongcovas · 2018-03-24T17:37:47Z

I am getting similar issues regarding the foreground bboxes. It is related to RPN layer.

In my case, i got this error when the bbox of GT is wrong like [-1, -1, 100, 200] where x and y are -1. The bbox representation in this code is uint16, so all -1 numbers are overflowed.

Tristacheng · 2018-04-21T05:01:47Z

@ChenXuelinCXL Hi, I met the same problem as yours, have you solved it?

underfitting · 2018-05-04T09:20:31Z

@lolongcovas In my case you are right!

ahmed-shariff · 2018-05-04T10:20:15Z

Tried reducing the learning rate? I sometimes have to set it as low as 0.00001

AkshayLaddha943 · 2024-03-20T11:20:34Z

Hi, I encountered the nan issue with my faster_rcnn_resnet50 model, turns out I was using Adam optimizer which led to the values going to Nan, I changed it back to SGD with momentum and weight decay (to which the original architecture was trained) and changed my learning rate from 0.01 to 0.0001, the results were better, Hope this helps

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Loss : Nan #79

Training Loss : Nan #79

xiaoxingzeng commented Feb 23, 2018

xiaoxingzeng commented Feb 23, 2018 •

edited

Loading

JavHaro commented Mar 1, 2018

zeehasham commented Mar 9, 2018

yuanyao366 commented Mar 14, 2018

jwyang commented Mar 14, 2018

yuanyao366 commented Mar 15, 2018

yuanyao366 commented Mar 15, 2018

jwyang commented Mar 15, 2018

xuelin-chen commented Mar 19, 2018

xuelin-chen commented Mar 19, 2018

JavHaro commented Mar 19, 2018

xuelin-chen commented Mar 19, 2018

yuanyao366 commented Mar 19, 2018

xuelin-chen commented Mar 19, 2018

JavHaro commented Mar 20, 2018

xuelin-chen commented Mar 20, 2018

lolongcovas commented Mar 24, 2018

Tristacheng commented Apr 21, 2018

underfitting commented May 4, 2018

ahmed-shariff commented May 4, 2018

AkshayLaddha943 commented Mar 20, 2024

Training Loss : Nan #79

Training Loss : Nan #79

Comments

xiaoxingzeng commented Feb 23, 2018

xiaoxingzeng commented Feb 23, 2018 • edited Loading

JavHaro commented Mar 1, 2018

zeehasham commented Mar 9, 2018

yuanyao366 commented Mar 14, 2018

jwyang commented Mar 14, 2018

yuanyao366 commented Mar 15, 2018

yuanyao366 commented Mar 15, 2018

jwyang commented Mar 15, 2018

xuelin-chen commented Mar 19, 2018

xuelin-chen commented Mar 19, 2018

JavHaro commented Mar 19, 2018

xuelin-chen commented Mar 19, 2018

yuanyao366 commented Mar 19, 2018

xuelin-chen commented Mar 19, 2018

JavHaro commented Mar 20, 2018

xuelin-chen commented Mar 20, 2018

lolongcovas commented Mar 24, 2018

Tristacheng commented Apr 21, 2018

underfitting commented May 4, 2018

ahmed-shariff commented May 4, 2018

AkshayLaddha943 commented Mar 20, 2024

xiaoxingzeng commented Feb 23, 2018 •

edited

Loading