Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

At training the loss bbox_loss is always zero #266

Open
cyberdecker opened this issue Jul 20, 2016 · 14 comments
Open

At training the loss bbox_loss is always zero #266

cyberdecker opened this issue Jul 20, 2016 · 14 comments

Comments

@cyberdecker
Copy link

I'm trying to train a ZF network for a custom dataset (following instructions from here), using the command:

./tools/train_faster_rcnn_alt_opt.py --gpu 0 \
    --net_name custom --weights data/imagenet_models/ZF.v2.caffemodel \
    --imdb custom_train --cfg config.yml

In the RPN stage, the loss seems ok, but when it arrive at the stage 1 of Fast R-CNN model, the loss is NAN:

I0720 10:42:34.174381  3892 solver.cpp:228] Iteration 0, loss = nan
I0720 10:42:34.174423  3892 solver.cpp:244]     Train net output #0: bbox_loss = nan (* 1 = nan loss)
I0720 10:42:34.174432  3892 solver.cpp:244]     Train net output #1: cls_loss = 1.04477 (* 1 = 1.04477 loss)
I0720 10:42:34.174438  3892 sgd_solver.cpp:106] Iteration 0, lr = 0.0001
I0720 10:42:42.240824  3892 solver.cpp:228] Iteration 20, loss = nan
I0720 10:42:42.240862  3892 solver.cpp:244]     Train net output #0: bbox_loss = nan (* 1 = nan loss)
I0720 10:42:42.240870  3892 solver.cpp:244]     Train net output #1: cls_loss = 0.130728 (* 1 = 0.130728 loss)

What it means? Is the network learning something or not working?

@cyberdecker
Copy link
Author

After figuring out the NAN, I adjusted the learning rate and also a smaller dataset to be sure if the network is learning anything. At least I don't have anymore the NAN values in loss. But I keep having the loss of bbox_loss at zero:

I0721 15:41:54.542963  2262 sgd_solver.cpp:106] Iteration 2420, lr = 0.001
I0721 15:41:56.630101  2262 solver.cpp:228] Iteration 2440, loss = 0.139153
I0721 15:41:56.630146  2262 solver.cpp:244]     Train net output #0: bbox_loss = 0 (* 1 = 0 loss)
I0721 15:41:56.630153  2262 solver.cpp:244]     Train net output #1: cls_loss = 0.139153 (* 1 = 0.139153 loss)
I0721 15:41:56.630161  2262 sgd_solver.cpp:106] Iteration 2440, lr = 0.001
I0721 15:41:58.699415  2262 solver.cpp:228] Iteration 2460, loss = 0.13915
I0721 15:41:58.699458  2262 solver.cpp:244]     Train net output #0: bbox_loss = 0 (* 1 = 0 loss)
I0721 15:41:58.699466  2262 solver.cpp:244]     Train net output #1: cls_loss = 0.13915 (* 1 = 0.13915 loss)
I0721 15:41:58.699473  2262 sgd_solver.cpp:106] Iteration 2460, lr = 0.001
I0721 15:42:00.765265  2262 solver.cpp:228] Iteration 2480, loss = 0.139147
I0721 15:42:00.765313  2262 solver.cpp:244]     Train net output #0: bbox_loss = 0 (* 1 = 0 loss)
I0721 15:42:00.765321  2262 solver.cpp:244]     Train net output #1: cls_loss = 0.139147 (* 1 = 0.139147 loss)

The loss of bbox should be different of zero, right?

@cyberdecker cyberdecker changed the title NAN value in loss at training fast rcnn stage At training the loss bbox_loss is always zero Jul 21, 2016
@maxteleg
Copy link

Hi I've met the same issue here. Do you figure out how to solve this? Is there anything to do with the image dataset or the annotation? Thank you!

@cyberdecker
Copy link
Author

Hi,
I have padded the images, because only cropped images get this issue for some reason. I think is maybe the coordinates.
So you need to pad the images before the training.

@maxteleg
Copy link

Thank you! I solved this. There are some negative bbox values in my dataset. I just get rid of those data and everything works fine now.

@Mato98
Copy link

Mato98 commented Nov 29, 2016

Hi, what is meant by "pad the images" ? During my training the values for bbox_loss and rpn_loss_bbox are always 0. I tried different datasets but no change. Do these values have to be different from 0?

image

Results with this model are not satisfying, very big boxes und low scores for classes!

any advices?

@jinyu121
Copy link

jinyu121 commented Jan 8, 2017

Hi~ I have the same problem, or even worse.

Because I can not use GPU, I setup the environment following this blog and this blog , and run the demo using ./experiments/scripts/faster_rcnn_alt_opt.sh 0 VGG16 pascal_voc (You can ignore the GPU_ID = 0)

This is the training log:

Solving...
I0108 19:52:36.975473  3821 solver.cpp:229] Iteration 0, loss = 1.35416
I0108 19:52:36.975520  3821 solver.cpp:245]     Train net output #0: rpn_cls_loss = 0.725054 (* 1 = 0.725054 loss)
I0108 19:52:36.975533  3821 solver.cpp:245]     Train net output #1: rpn_loss_bbox = 0.629103 (* 1 = 0.629103 loss)
I0108 19:52:36.975540  3821 sgd_solver.cpp:106] Iteration 0, lr = 0.001
I0108 20:02:41.254108  3821 solver.cpp:229] Iteration 20, loss = -nan
I0108 20:02:41.254150  3821 solver.cpp:245]     Train net output #0: rpn_cls_loss = -nan (* 1 = -nan loss)
I0108 20:02:41.254163  3821 solver.cpp:245]     Train net output #1: rpn_loss_bbox = -nan (* 1 = -nan loss)
I0108 20:02:41.254170  3821 sgd_solver.cpp:106] Iteration 20, lr = 0.001
I0108 20:12:35.014883  3821 solver.cpp:229] Iteration 40, loss = -nan
I0108 20:12:35.014927  3821 solver.cpp:245]     Train net output #0: rpn_cls_loss = -nan (* 1 = -nan loss)
I0108 20:12:35.014940  3821 solver.cpp:245]     Train net output #1: rpn_loss_bbox = -nan (* 1 = -nan loss)

So, how can I fix it? Is there anything relation to the CPU MOD files?

UPDATE: I changed the base_lr to 0.00001 (1e-5), and it worked.

@Soda-Wong
Copy link

Soda-Wong commented Mar 23, 2017

Hi~
@jinyu121 @Max-intel @cyberdecker @Mato98
I meet the seem problem too.My results and AP is zero. I use the voc2007 image data so I think there is
no problem with "pad the images".And why is this helpful to change the base_lr to 0.00001 (1e-5)?
Any advices?thanks!

@jinyu121
Copy link

If you are using CPU mode, please use this pull request.

@Soda-Wong
Copy link

@jinyu121 thank you~
but I am using the GPU mode,and I changed the base_lr to 1e-5,now the result are not 0 but like 0.0004.

@nmahesh01
Copy link

@Mato98 @Soda-Wong Hi, i am facing the exact same issue with my bbox_loss = 0. Even using a LR of 0.00005 is not changing it! Any ideas or solutions?
I am using VOC2007 + VOC2012 training data

@starxhong
Copy link

Have anybody solved the problem? I Meet this issue when i run a R-FCN trainning with OHEM, loss_bbox = 0 (* 1 = 0 loss) from the very first iteration. However, when i run code without OHEM, everything is ok and i got a MAP of 0.78. So i'm sure there is nothing to do with my data annotation. WHAT else can cause this?

@mukeshmithrakumar
Copy link

Having the same issue, kinda got it down to the anchor boxes. the IoU between the anchor boxes and the ground truth is always zero so it always gives a negative, hence everything ends up zero. Still looking for a solution, will appreciate if you guys found one @VersionHX

@starxhong
Copy link

@mukeshmithrakumar yeah i have solved the problem with the solution here. It seems to be some version conflict of Numpy.

@mukeshmithrakumar
Copy link

Thanks @VersionHX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants