Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Loss : Nan #79

Open
xiaoxingzeng opened this issue Feb 23, 2018 · 21 comments
Open

Training Loss : Nan #79

xiaoxingzeng opened this issue Feb 23, 2018 · 21 comments

Comments

@xiaoxingzeng
Copy link

My training loss always becomes NAN when the iteteration comes to several hundred iters.
All parameters are default. My training dataset is useable for py-faster rcnn, and copy to faster-rcnn.pytorch
directory.
My training command is : python trainval_net.py --dataset pascal_voc --net vgg16 --bs 1 --lr 0.001 --cuda

There are some advices for that? TKS

@xiaoxingzeng
Copy link
Author

xiaoxingzeng commented Feb 23, 2018

This is my training print-out
[session 1][epoch 1][iter 0] loss: 6.3588, lr: 1.00e-03
fg/bg=(20/236), time cost: 1.659386
rpn_cls: 0.8163, rpn_box: 4.6926, rcnn_cls: 0.8488, rcnn_box 0.0010
[session 1][epoch 1][iter 100] loss: 1.0697, lr: 1.00e-03
fg/bg=(24/232), time cost: 33.015444
rpn_cls: 0.1404, rpn_box: 0.6121, rcnn_cls: 0.2425, rcnn_box 0.1688
[session 1][epoch 1][iter 200] loss: 0.7961, lr: 1.00e-03
fg/bg=(43/213), time cost: 33.076333
rpn_cls: 0.1488, rpn_box: 1.1833, rcnn_cls: 0.3630, rcnn_box 0.2185
[session 1][epoch 1][iter 300] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 33.628527
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 400] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 32.910808
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 500] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 32.843017
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 32.721040
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 700] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 33.876777
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 800] loss: nan, lr: 1.00e-03
fg/bg=(256/0), time cost: 33.819963
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan

@JavHaro
Copy link

JavHaro commented Mar 1, 2018

Hi @xiaoxingzeng ,
I have the same problem but sadly i still have no answer. Are you using a custom dataset? Maybe some answers of this issue can help you.
Good luck!

@zeehasham
Copy link

@JavHaro @xiaoxingzeng I am having the same issue of nan values with custom dataset. Did you find the solution? Thanks

@yuanyao366
Copy link

I have the same probelm with my own dataset. Did you find the solution? Thanks a lot!

[session 1][epoch 1][iter 5400/33021] loss: 0.3792, lr: 1.00e-03
fg/bg=(113/399), time cost: 129.585304
rpn_cls: 0.0753, rpn_box: 0.2207, rcnn_cls: 0.1785, rcnn_box 0.2806
[session 1][epoch 1][iter 5500/33021] loss: 0.3538, lr: 1.00e-03
fg/bg=(42/470), time cost: 129.476077
rpn_cls: 0.0916, rpn_box: 0.1274, rcnn_cls: 0.0800, rcnn_box 0.1175
[session 1][epoch 1][iter 5600/33021] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 122.264345
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 5700/33021] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 119.232911
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan

@jwyang
Copy link
Owner

jwyang commented Mar 14, 2018

Hi, all, If possible, it would be good to share your data loaders with us, so that we can check why that happened.

@yuanyao366
Copy link

@jwyang I use caltech dataset for pedestrian detection. When training, I print num_boxes and gt_boxes under "print loss" like this:
print("[session %d][epoch %2d][iter %4d] loss: %.4f, lr: %.2e"
% (args.session, epoch, step, loss_temp, lr))
print("\t\t\tfg/bg=(%d/%d), time cost: %f" % (fg_cnt, bg_cnt, end-start))
print("\t\t\trpn_cls: %.4f, rpn_box: %.4f, rcnn_cls: %.4f, rcnn_box %.4f"
% (loss_rpn_cls, loss_rpn_box, loss_rcnn_cls, loss_rcnn_box))

    gt_boxes_cpu=gt_boxes.cpu().data.numpy()
    im_data_cpu=im_data.cpu().data
    for i in range(args.batch_size):
        num_gt=num_boxes.data[i]
        print("the %dth image have num_boxes: %d and the gt_boxes are:"%(i,num_gt))
        print(gt_boxes_cpu[i][:num_gt])
        img=im_data_cpu[i].permute(1,2,0).numpy()
        #_vis_minibatch(img, gt_boxes_cpu[i][:num_gt])

I even visualize the image and gt_boxes like this:
def _vis_minibatch(im_blob, rois_blob):
"""Visualize a mini-batch for debugging."""
import matplotlib.pyplot as plt
for i in xrange(rois_blob.shape[0]):
rois = rois_blob[i, :]
#im_ind = rois[0]
roi = rois[:4]
im = im_blob[:, :, :].copy()
im += cfg.PIXEL_MEANS
im = im[:, :, (2, 1, 0)]
im = im.astype(np.uint8)
plt.imshow(im)
plt.gca().add_patch(
plt.Rectangle((roi[0], roi[1]), roi[2] - roi[0],
roi[3] - roi[1], fill=False,
edgecolor='r', linewidth=3)
)
plt.show()

@yuanyao366
Copy link

My training command is : python trainval_net.py --dataset caltech --net vgg16 --bs 2 --gpu 0 --cuda
This is my training print-out:

[session 1][epoch 1][iter 100] loss: 0.6892, lr: 1.00e-03
fg/bg=(19/493), time cost: 125.878273
rpn_cls: 0.2557, rpn_box: 0.4797, rcnn_cls: 0.1704, rcnn_box 0.0634
the 0th image have num_boxes: 3 and the gt_boxes are:
[[ 465. 215. 496.25 271.25 1. ]
[ 536.25 206.25 561.25 255. 1. ]
[ 168.75 216.25 218.75 303.75 1. ]]
the 1th image have num_boxes: 3 and the gt_boxes are:
[[ 548.75 210. 550. 238.75 1. ]
[ 175. 216.25 227.5 303.75 1. ]
[ 465. 215. 496.25 271.25 1. ]]
[session 1][epoch 1][iter 200] loss: 0.5355, lr: 1.00e-03
fg/bg=(20/492), time cost: 124.361140
rpn_cls: 0.2848, rpn_box: 0.4464, rcnn_cls: 0.1204, rcnn_box 0.0524
the 0th image have num_boxes: 6 and the gt_boxes are:
[[ 388.75 207.5 402.5 253.75 1. ]
[ 511.25 193.75 523.75 232.5 1. ]
[ 327.5 195. 345. 242.5 1. ]
[ 493.75 193.75 512.5 238.75 1. ]
[ 407.5 195. 425. 252.5 1. ]
[ 210. 190. 236.25 253.75 1. ]]
the 1th image have num_boxes: 5 and the gt_boxes are:
[[ 450. 200. 451.25 251.25 1. ]
[ 506.25 206.25 523.75 250. 1. ]
[ 258.75 202.5 273.75 250. 1. ]
[ 497.5 201.25 512.5 246.25 1. ]
[ 422.5 211.25 435. 252.5 1. ]]
[session 1][epoch 1][iter 300] loss: 0.4500, lr: 1.00e-03
fg/bg=(4/508), time cost: 124.083962
rpn_cls: 0.1161, rpn_box: 0.0771, rcnn_cls: 0.0267, rcnn_box 0.0117
the 0th image have num_boxes: 1 and the gt_boxes are:
[[ 720. 243.75 743.75 303.75 1. ]]
the 1th image have num_boxes: 1 and the gt_boxes are:
[[ 720. 243.75 743.75 303.75 1. ]]
[session 1][epoch 1][iter 400] loss: 0.4334, lr: 1.00e-03
fg/bg=(3/509), time cost: 123.896316
rpn_cls: 0.2213, rpn_box: 0.0603, rcnn_cls: 0.0627, rcnn_box 0.0038
the 0th image have num_boxes: 1 and the gt_boxes are:
[[ 22.5 225. 45. 261.25 1. ]]
the 1th image have num_boxes: 1 and the gt_boxes are:
[[ 28.75 230. 37.5 257.5 1. ]]
[session 1][epoch 1][iter 500] loss: 0.4303, lr: 1.00e-03
fg/bg=(19/493), time cost: 124.256176
rpn_cls: 0.1117, rpn_box: 0.2408, rcnn_cls: 0.0595, rcnn_box 0.0675
the 0th image have num_boxes: 3 and the gt_boxes are:
[[ 178.75 205. 208.75 277.5 1. ]
[ 541.25 217.5 558.75 257.5 1. ]
[ 595. 210. 618.75 268.75 1. ]]
the 1th image have num_boxes: 3 and the gt_boxes are:
[[ 183.75 206.25 213.75 277.5 1. ]
[ 592.5 208.75 616.25 267.5 1. ]
[ 541.25 217.5 558.75 257.5 1. ]]
[session 1][epoch 1][iter 600] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 124.420586
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
the 0th image have num_boxes: 1 and the gt_boxes are:
[[ 408.75 225. 427.5 247.5 1. ]]
the 1th image have num_boxes: 1 and the gt_boxes are:
[[ 380. 233.75 398.75 256.25 1. ]]
[session 1][epoch 1][iter 700] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 124.281587
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
the 0th image have num_boxes: 1 and the gt_boxes are:
[[ 610. 176.25 638.75 248.75 1. ]]
the 1th image have num_boxes: 1 and the gt_boxes are:
[[ 606.25 176.25 635. 248.75 1. ]]

I think the im_data and gt_boxes are right before being sent into the net, but when training, the "fg/bg" is abnormal and the number of "fg" is too small until the"fg/bg" is "512/0". I have been confused by this problem for one week and I hope I can get some useful advice as soon as possible.
Thanks a lot!

@jwyang
Copy link
Owner

jwyang commented Mar 15, 2018

Hi, @sdlyaoyuan , I will take a look at this problem, and try to solve it as soon as possible.

@xuelin-chen
Copy link

Hi, @jwyang , I had the same issue these day too when I tried to use my customized dataset to train. The same thing (NAN) happened when 'fg/bg' is 1024/0. It seems like NAN occurs when there are no bg samples during the rois sampling. Maybe the code never steped into this case when using dataset like COCO?

Thanks a lot!:)

@xuelin-chen
Copy link

image
This is abnormal I guess... once the bg=0 occurs, it keeps happening to all the following batches...

@JavHaro
Copy link

JavHaro commented Mar 19, 2018

Hi @jwyang , @ChenXuelinCXL, and @sdlyaoyuan
I just posted in other thread that i have located the problem. In my case, the problem is in the annotations loading. I don't know why, when you load the minimum values of the annotation (xmin & y min) if they are close to 0 it loads 65534 (the maximum value minus 2) so when you manage the areas and calculate xmax-xmin the value is negative. I solve it checking the minimum values after loading.
I hope this could help you.

@xuelin-chen
Copy link

Hi @JavHaro , thanks for the help, I will see if that is my case.:)

BTW, I am wondering if any one of you had your code entered these 'elif' when training?
image

@yuanyao366
Copy link

@ChenXuelinCXL If you make your customized dataset as the format of pascal voc , you can try make such change in pascal_voc.py:
for ix, obj in enumerate(objs):
bbox = obj.find('bndbox')
# Make pixel indexes 0-based
x1 = float(bbox.find('xmin').text) #- 1
y1 = float(bbox.find('ymin').text) #- 1
x2 = float(bbox.find('xmax').text) #- 1
y2 = float(bbox.find('ymax').text) #- 1

@xuelin-chen
Copy link

OK.. after I tried several times. I double checked my code to make the bounding box coordinations all correct. It is still producing NAN. The bg=0 still happens.

@JavHaro
Copy link

JavHaro commented Mar 20, 2018

Hi @ChenXuelinCXL
Where did you check the bounding box coordinates? I say it because depending on the part of the code where you check them the value to check will be different. Anyway, you can check if xmin<xmax or ymin<ymax in any part, just to locate the problem.
KR

@xuelin-chen
Copy link

@JavHaro I am still locating the bug. I am very sure that the bounding box from myd data is correct. Now I got this:
image
I print the rois output from RPN, this is what causes number of bg = 0, all proposals from RPN are almost zero boxes, except the first few wired boxes.

Trying to figure out what causes this. Note I add a filter to assign zero scores for those very small proposal boxes in RPN, but RPN still outputs them, that measn all proposals in RPN are almost zero boxes!?

@lolongcovas
Copy link

I am getting similar issues regarding the foreground bboxes. It is related to RPN layer.

In my case, i got this error when the bbox of GT is wrong like [-1, -1, 100, 200] where x and y are -1. The bbox representation in this code is uint16, so all -1 numbers are overflowed.

@Tristacheng
Copy link

@ChenXuelinCXL Hi, I met the same problem as yours, have you solved it?

@underfitting
Copy link

@lolongcovas In my case you are right!

@ahmed-shariff
Copy link
Contributor

Tried reducing the learning rate? I sometimes have to set it as low as 0.00001

@AkshayLaddha943
Copy link

Hi, I encountered the nan issue with my faster_rcnn_resnet50 model, turns out I was using Adam optimizer which led to the values going to Nan, I changed it back to SGD with momentum and weight decay (to which the original architecture was trained) and changed my learning rate from 0.01 to 0.0001, the results were better, Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests