-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Loss : Nan #79
Comments
This is my training print-out |
Hi @xiaoxingzeng , |
@JavHaro @xiaoxingzeng I am having the same issue of nan values with custom dataset. Did you find the solution? Thanks |
I have the same probelm with my own dataset. Did you find the solution? Thanks a lot! [session 1][epoch 1][iter 5400/33021] loss: 0.3792, lr: 1.00e-03 |
Hi, all, If possible, it would be good to share your data loaders with us, so that we can check why that happened. |
@jwyang I use caltech dataset for pedestrian detection. When training, I print num_boxes and gt_boxes under "print loss" like this:
I even visualize the image and gt_boxes like this: |
My training command is : python trainval_net.py --dataset caltech --net vgg16 --bs 2 --gpu 0 --cuda [session 1][epoch 1][iter 100] loss: 0.6892, lr: 1.00e-03 I think the im_data and gt_boxes are right before being sent into the net, but when training, the "fg/bg" is abnormal and the number of "fg" is too small until the"fg/bg" is "512/0". I have been confused by this problem for one week and I hope I can get some useful advice as soon as possible. |
Hi, @sdlyaoyuan , I will take a look at this problem, and try to solve it as soon as possible. |
Hi, @jwyang , I had the same issue these day too when I tried to use my customized dataset to train. The same thing (NAN) happened when 'fg/bg' is 1024/0. It seems like NAN occurs when there are no bg samples during the rois sampling. Maybe the code never steped into this case when using dataset like COCO? Thanks a lot!:) |
Hi @jwyang , @ChenXuelinCXL, and @sdlyaoyuan |
Hi @JavHaro , thanks for the help, I will see if that is my case.:) BTW, I am wondering if any one of you had your code entered these 'elif' when training? |
@ChenXuelinCXL If you make your customized dataset as the format of pascal voc , you can try make such change in pascal_voc.py: |
OK.. after I tried several times. I double checked my code to make the bounding box coordinations all correct. It is still producing NAN. The bg=0 still happens. |
Hi @ChenXuelinCXL |
@JavHaro I am still locating the bug. I am very sure that the bounding box from myd data is correct. Now I got this: Trying to figure out what causes this. Note I add a filter to assign zero scores for those very small proposal boxes in RPN, but RPN still outputs them, that measn all proposals in RPN are almost zero boxes!? |
I am getting similar issues regarding the foreground bboxes. It is related to RPN layer. In my case, i got this error when the bbox of GT is wrong like [-1, -1, 100, 200] where |
@ChenXuelinCXL Hi, I met the same problem as yours, have you solved it? |
@lolongcovas In my case you are right! |
Tried reducing the learning rate? I sometimes have to set it as low as 0.00001 |
Hi, I encountered the nan issue with my faster_rcnn_resnet50 model, turns out I was using Adam optimizer which led to the values going to Nan, I changed it back to SGD with momentum and weight decay (to which the original architecture was trained) and changed my learning rate from 0.01 to 0.0001, the results were better, Hope this helps |
My training loss always becomes NAN when the iteteration comes to several hundred iters.
All parameters are default. My training dataset is useable for py-faster rcnn, and copy to faster-rcnn.pytorch
directory.
My training command is : python trainval_net.py --dataset pascal_voc --net vgg16 --bs 1 --lr 0.001 --cuda
There are some advices for that? TKS
The text was updated successfully, but these errors were encountered: