Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating Point Exception #740

Open
wubaorong opened this issue Nov 23, 2017 · 18 comments
Open

Floating Point Exception #740

wubaorong opened this issue Nov 23, 2017 · 18 comments

Comments

@wubaorong
Copy link

when I train the network,it broken down after print one log:
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Floating Point Exception

@karaspd
Copy link

karaspd commented Nov 30, 2017

@wubaorong Hi, I am getting the same error, did you have any chance to solve it?

@wubaorong
Copy link
Author

@karaspd I'm sorry ,I still don't get the solution.

@karaspd
Copy link

karaspd commented Dec 5, 2017

@wubaorong Hi, I finally find the issue, in my case it was the train prototxt! (strangely I do not see any difference inside the prototxt after I updated it) However, I did test with pascal-voc first and it was working, but I had issues with the new dataset. I checked all the annotation files, even modified the codes for loading dataset and at the end it was not about data.

@wubaorong
Copy link
Author

@karaspd Even I train faster_rcnn with VOC2007 dataset,I got same error,do you know the reason?

@wubaorong
Copy link
Author

@karaspd When I use train the train_net.py to train the model,the following is result :
Solving...
I1120 20:15:42.906991 3777 solver.cpp:229] Iteration 0, loss = 4.52421
I1120 20:15:42.907033 3777 solver.cpp:245] Train net output #0: bbox_loss = 0.024606 (* 1 = 0.024606 loss)
I1120 20:15:42.907042 3777 solver.cpp:245] Train net output #1: cls_loss = 3.55821 (* 1 = 3.55821 loss)
I1120 20:15:42.907047 3777 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.683348 (* 1 = 0.683348 loss)
I1120 20:15:42.907052 3777 solver.cpp:245] Train net output #3: rpn_loss_bbox = 0.236713 (* 1 = 0.236713 loss)
I1120 20:15:42.907059 3777 sgd_solver.cpp:106] Iteration 0, lr = 0.001
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
because "dw,dh" become too big, I still don't know the reason.Can you help me?

@karaspd
Copy link

karaspd commented Dec 7, 2017

@wubaorong may I see your train.prototxt file?
my issue was because of last layer in the train.prototxt :
layer {
name: "accuracy"
type: "Accuracy"
bottom: "cls_score"
bottom: "labels"
top: "accuracy"
}
I only removed this layer and now it is working.

@wubaorong
Copy link
Author

@karaspd I want to train the faster rcnn. So I don't alter any file, I download the train.prototxt file on the github directly.The following is my train.prototxt :
name: "ZF"
layer {
name: 'input-data'
type: 'Python'
top: 'data'
top: 'im_info'
top: 'gt_boxes'
python_param {
module: 'roi_data_layer.layer'
layer: 'RoIDataLayer'
param_str: "'num_classes': 21"
}
}

#========= conv1-conv5 ============

layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 96
kernel_size: 7
pad: 3
stride: 2
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "conv1"
top: "conv1"
}
layer {
name: "norm1"
type: "LRN"
bottom: "conv1"
top: "norm1"
lrn_param {
local_size: 3
alpha: 0.00005
beta: 0.75
norm_region: WITHIN_CHANNEL
engine: CAFFE
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "norm1"
top: "pool1"
pooling_param {
kernel_size: 3
stride: 2
pad: 1
pool: MAX
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 256
kernel_size: 5
pad: 2
stride: 2
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "norm2"
type: "LRN"
bottom: "conv2"
top: "norm2"
lrn_param {
local_size: 3
alpha: 0.00005
beta: 0.75
norm_region: WITHIN_CHANNEL
engine: CAFFE
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "norm2"
top: "pool2"
pooling_param {
kernel_size: 3
stride: 2
pad: 1
pool: MAX
}
}
layer {
name: "conv3"
type: "Convolution"
bottom: "pool2"
top: "conv3"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 384
kernel_size: 3
pad: 1
stride: 1
}
}
layer {
name: "relu3"
type: "ReLU"
bottom: "conv3"
top: "conv3"
}
layer {
name: "conv4"
type: "Convolution"
bottom: "conv3"
top: "conv4"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 384
kernel_size: 3
pad: 1
stride: 1
}
}
layer {
name: "relu4"
type: "ReLU"
bottom: "conv4"
top: "conv4"
}
layer {
name: "conv5"
type: "Convolution"
bottom: "conv4"
top: "conv5"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 256
kernel_size: 3
pad: 1
stride: 1
}
}
layer {
name: "relu5"
type: "ReLU"
bottom: "conv5"
top: "conv5"
}

#========= RPN ============

layer {
name: "rpn_conv/3x3"
type: "Convolution"
bottom: "conv5"
top: "rpn/output"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 256
kernel_size: 3 pad: 1 stride: 1
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
}
}
layer {
name: "rpn_relu/3x3"
type: "ReLU"
bottom: "rpn/output"
top: "rpn/output"
}

#layer {

name: "rpn_conv/3x3"

type: "Convolution"

bottom: "conv5"

top: "rpn_conv/3x3"

param { lr_mult: 1.0 }

param { lr_mult: 2.0 }

convolution_param {

num_output: 192

kernel_size: 3 pad: 1 stride: 1

weight_filler { type: "gaussian" std: 0.01 }

bias_filler { type: "constant" value: 0 }

}

#}
#layer {

name: "rpn_conv/5x5"

type: "Convolution"

bottom: "conv5"

top: "rpn_conv/5x5"

param { lr_mult: 1.0 }

param { lr_mult: 2.0 }

convolution_param {

num_output: 64

kernel_size: 5 pad: 2 stride: 1

weight_filler { type: "gaussian" std: 0.0036 }

bias_filler { type: "constant" value: 0 }

}

#}
#layer {

name: "rpn/output"

type: "Concat"

bottom: "rpn_conv/3x3"

bottom: "rpn_conv/5x5"

top: "rpn/output"

#}
#layer {

name: "rpn_relu/output"

type: "ReLU"

bottom: "rpn/output"

top: "rpn/output"

#}

layer {
name: "rpn_cls_score"
type: "Convolution"
bottom: "rpn/output"
top: "rpn_cls_score"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 18 # 2(bg/fg) * 9(anchors)
kernel_size: 1 pad: 0 stride: 1
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
}
}
layer {
name: "rpn_bbox_pred"
type: "Convolution"
bottom: "rpn/output"
top: "rpn_bbox_pred"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
convolution_param {
num_output: 36 # 4 * 9(anchors)
kernel_size: 1 pad: 0 stride: 1
weight_filler { type: "gaussian" std: 0.01 }
bias_filler { type: "constant" value: 0 }
}
}
layer {
bottom: "rpn_cls_score"
top: "rpn_cls_score_reshape"
name: "rpn_cls_score_reshape"
type: "Reshape"
reshape_param { shape { dim: 0 dim: 2 dim: -1 dim: 0 } }
}
layer {
name: 'rpn-data'
type: 'Python'
bottom: 'rpn_cls_score'
bottom: 'gt_boxes'
bottom: 'im_info'
bottom: 'data'
top: 'rpn_labels'
top: 'rpn_bbox_targets'
top: 'rpn_bbox_inside_weights'
top: 'rpn_bbox_outside_weights'
python_param {
module: 'rpn.anchor_target_layer'
layer: 'AnchorTargetLayer'
param_str: "'feat_stride': 16"
}
}
layer {
name: "rpn_loss_cls"
type: "SoftmaxWithLoss"
bottom: "rpn_cls_score_reshape"
bottom: "rpn_labels"
propagate_down: 1
propagate_down: 0
top: "rpn_cls_loss"
loss_weight: 1
loss_param {
ignore_label: -1
normalize: true
}
}
layer {
name: "rpn_loss_bbox"
type: "SmoothL1Loss"
bottom: "rpn_bbox_pred"
bottom: "rpn_bbox_targets"
bottom: 'rpn_bbox_inside_weights'
bottom: 'rpn_bbox_outside_weights'
top: "rpn_loss_bbox"
loss_weight: 1
smooth_l1_loss_param { sigma: 3.0 }
}

#========= RoI Proposal ============

layer {
name: "rpn_cls_prob"
type: "Softmax"
bottom: "rpn_cls_score_reshape"
top: "rpn_cls_prob"
}
layer {
name: 'rpn_cls_prob_reshape'
type: 'Reshape'
bottom: 'rpn_cls_prob'
top: 'rpn_cls_prob_reshape'
reshape_param { shape { dim: 0 dim: 18 dim: -1 dim: 0 } }
}
layer {
name: 'proposal'
type: 'Python'
bottom: 'rpn_cls_prob_reshape'
bottom: 'rpn_bbox_pred'
bottom: 'im_info'
top: 'rpn_rois'

top: 'rpn_scores'

python_param {
module: 'rpn.proposal_layer'
layer: 'ProposalLayer'
param_str: "'feat_stride': 16"
}
}
#layer {

name: 'debug-data'

type: 'Python'

bottom: 'data'

bottom: 'rpn_rois'

bottom: 'rpn_scores'

python_param {

module: 'rpn.debug_layer'

layer: 'RPNDebugLayer'

}

#}
layer {
name: 'roi-data'
type: 'Python'
bottom: 'rpn_rois'
bottom: 'gt_boxes'
top: 'rois'
top: 'labels'
top: 'bbox_targets'
top: 'bbox_inside_weights'
top: 'bbox_outside_weights'
python_param {
module: 'rpn.proposal_target_layer'
layer: 'ProposalTargetLayer'
param_str: "'num_classes': 21"
}
}

#========= RCNN ============

layer {
name: "roi_pool_conv5"
type: "ROIPooling"
bottom: "conv5"
bottom: "rois"
top: "roi_pool_conv5"
roi_pooling_param {
pooled_w: 6
pooled_h: 6
spatial_scale: 0.0625 # 1/16
}
}
layer {
name: "fc6"
type: "InnerProduct"
bottom: "roi_pool_conv5"
top: "fc6"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
inner_product_param {
num_output: 4096
}
}
layer {
name: "relu6"
type: "ReLU"
bottom: "fc6"
top: "fc6"
}
layer {
name: "drop6"
type: "Dropout"
bottom: "fc6"
top: "fc6"
dropout_param {
dropout_ratio: 0.5
scale_train: false
}
}
layer {
name: "fc7"
type: "InnerProduct"
bottom: "fc6"
top: "fc7"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
inner_product_param {
num_output: 4096
}
}
layer {
name: "relu7"
type: "ReLU"
bottom: "fc7"
top: "fc7"
}
layer {
name: "drop7"
type: "Dropout"
bottom: "fc7"
top: "fc7"
dropout_param {
dropout_ratio: 0.5
scale_train: false
}
}
layer {
name: "cls_score"
type: "InnerProduct"
bottom: "fc7"
top: "cls_score"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
inner_product_param {
num_output: 21
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "bbox_pred"
type: "InnerProduct"
bottom: "fc7"
top: "bbox_pred"
param { lr_mult: 1.0 }
param { lr_mult: 2.0 }
inner_product_param {
num_output: 84
weight_filler {
type: "gaussian"
std: 0.001
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "loss_cls"
type: "SoftmaxWithLoss"
bottom: "cls_score"
bottom: "labels"
propagate_down: 1
propagate_down: 0
top: "cls_loss"
loss_weight: 1
loss_param {
ignore_label: -1
normalize: true
}
}
layer {
name: "loss_bbox"
type: "SmoothL1Loss"
bottom: "bbox_pred"
bottom: "bbox_targets"
bottom: 'bbox_inside_weights'
bottom: 'bbox_outside_weights'
top: "bbox_loss"
loss_weight: 1
}

@karaspd
Copy link

karaspd commented Dec 8, 2017

@wubaorong are you trying to retrain ZF model from pretrained parameters with pascal-voc? What is the command you use to train?
Have you tried with VGG as well?

@karaspd
Copy link

karaspd commented Dec 8, 2017

@wubaorong check this issue as well #65. See if any of the solution works in your case.

@wubaorong
Copy link
Author

@karaspd Thank you very much. I have solved this problem with your help. Now I can train the net, but when I use ZF net to train faster rcnn end2end, in log folder without log file, but if I train faster rcnn alt_opt, the log file could autosaved. Do you know the reasons?

@wubaorong
Copy link
Author

@karaspd I have solved the log problem. I have a new question, if I want to use the model that have iterated 10000 times to continue the train, Where I should to alter? Is that the faster_rcnn_end2end.sh file?
time ./tools/train_net.py --gpu ${GPU_ID}
--solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt
*--weights data/imagenet_models/${NET}.v2.caffemodel *
--imdb ${TRAIN_IMDB}
--iters ${ITERS}
--cfg experiments/cfgs/faster_rcnn_end2end.yml
${EXTRA_ARGS}

@karaspd
Copy link

karaspd commented Dec 11, 2017

@wubaorong yes you need to change faster_rcnn_end2end.sh if you are training with faster rcnn. If you want to continue training after some snapshot you had before, you can use
time ./tools/train_net.py --gpu ${GPU_ID}
--solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver_top.prototxt
--snapshot zf_faster_rcnn_iter_100000.solverstate
--imdb ${TRAIN_IMDB}
--iters ${ITERS}
--cfg experiments/cfgs/faster_rcnn_end2end.yml
${EXTRA_ARGS}

@wubaorong
Copy link
Author

@karaspd I change the faster_rcnn_end2end.sh as you said, but I met a error:
Traceback (most recent call last):
File "./tools/train_net.py", line 114, in
imdb, roidb = combined_roidb(args.imdb_name)
File "./tools/train_net.py", line 73, in combined_roidb
roidbs = [get_roidb(s) for s in imdb_names.split('+')]
File "./tools/train_net.py", line 70, in get_roidb
roidb = get_training_roidb(imdb)
File "/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 118, in get_training_roidb
imdb.append_flipped_images()
File "/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 106, in append_flipped_images
boxes = self.roidb[i]['boxes'].copy()
File "/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 67, in roidb
self._roidb = self.roidb_handler()
File "/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/datasets/pascal_voc.py", line 132, in selective_search_roidb
ss_roidb = self._load_selective_search_roidb(gt_roidb)
File "/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/datasets/pascal_voc.py", line 166, in _load_selective_search_roidb
'Selective search data not found at: {}'.format(filename)
AssertionError: Selective search data not found at: /home/wu/faster_rcnn/py-faster-rcnn/data/selective_search_data/voc_2007_trainval.mat
Do you know how to solve this problem?

@meetps
Copy link

meetps commented Feb 19, 2018

Check my comment here

@zqdeepbluesky
Copy link

@wubaorong @karaspd @meetshah1995 @Dectinc
hi,when I train FPN on my own dataset,I met error:

I0312 16:25:25.883342 2983 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
/home/zq/faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Floating point exception (core dumped)

I try to change lr from 0.001 to 0.0005,but it didn't work.I also change RNG_SEED,and it also didn't work.
I don't know how to solve it.please help me,thanks so much!

@wubaorong
Copy link
Author

@zqdeepbluesky You can try to change lr smaller,like 0.0001.I solve my problem by reducing lr.

@amlandas78
Copy link

Have anyone solved the problem? I get the same error at iteration 5800 while using the learning rate at 0.001 and at iteration 18800 while using 0.0001..If someone have solved the problem, please help me to solve it.

@Jacob-Bian
Copy link

Jacob-Bian commented May 10, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants