Error running the faster_rcnn_end2end.sh to train my own network #574

sohamghoshmusigma · 2017-05-12T07:12:10Z

Hi,

I'm facing problems running the training according to the instructions on the home page wiki.

The training starts and then it it throws out of memory error. To prevent that, I changed my batch size to 1. Even then it throws an error.

The command I"m trying to run is **./experiments/scripts/faster_rcnn_end2end.sh 0 VGG16 pascal_voc
**

I'm reproducing the last part of the error here :

I0512 12:39:29.638978 28435 net.cpp:228] input-data does not need backward computation.
I0512 12:39:29.638983 28435 net.cpp:270] This network produces output loss_bbox
I0512 12:39:29.638989 28435 net.cpp:270] This network produces output loss_cls
I0512 12:39:29.638996 28435 net.cpp:270] This network produces output rpn_cls_loss
I0512 12:39:29.639003 28435 net.cpp:270] This network produces output rpn_loss_bbox
I0512 12:39:29.639050 28435 net.cpp:283] Network initialization done.
I0512 12:39:29.639220 28435 solver.cpp:60] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/VGG16.v2.caffemodel
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432430
I0512 12:39:30.745797 28435 net.cpp:816] Ignoring source layer pool5
I0512 12:39:30.856762 28435 net.cpp:816] Ignoring source layer fc8
I0512 12:39:30.856811 28435 net.cpp:816] Ignoring source layer prob
Solving...
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:166: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
fg_inds = npr.choice(fg_inds, size=fg_rois_per_this_image, replace=False)
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:177: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
bg_inds = npr.choice(bg_inds, size=bg_rois_per_this_image, replace=False)
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:184: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
labels[fg_rois_per_this_image:] = 0
F0512 12:39:31.373610 28435 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
./experiments/scripts/faster_rcnn_end2end.sh: line 57: 28435 Aborted (core dumped) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt --weights data/imagenet_models/${NET}.v2.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/faster_rcnn_end2end.yml ${EXTRA_ARGS}

Any help, suggestion or followup would be highly appreciated

djdam · 2017-05-20T07:29:32Z

I think the "out of memory" part is self-explanatory? :) Training with the VGG16 dataset takes at least 6 GB of GPU memory. I am not able to run it as well, as I have 4.5 GB. You can try training with the ZF model, it takes less memory. Otherwise, go to AWS and setup a GPU cloud server, I did that as well.

sohamghoshmusigma · 2017-05-24T11:03:51Z

Thanks @djdam
Yes, Indeed it was self explanatory, I was wondering if there was some work-around for that. But I did ZF like you suggested, and my training did start and it worked till the RPN

And then, it threw this error

Wrote RPN proposals to /home/py-faster-rcnn/output/faster_rcnn_alt_opt/voc_2007_trainval/zf_rpn_stage1_iter_80000_proposals.pkl

Stage 1 Fast R-CNN using RPN proposals, init from ImageNet model

Init model: data/imagenet_models/ZF.v2.caffemodel
RPN proposals: /home/py-faster-rcnn/output/faster_rcnn_alt_opt/voc_2007_trainval/zf_rpn_stage1_iter_80000_proposals.pkl
Using config:
{'DATA_DIR': '/home/py-faster-rcnn/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'faster_rcnn_alt_opt',
'GPU_ID': 0,
'MATLAB': 'matlab',
'MODELS_DIR': '/home/py-faster-rcnn/models/pascal_voc',
'PIXEL_MEANS': array([[[ 102.9801, 115.9465, 122.7717]]]),
'RNG_SEED': 3,
'ROOT_DIR': '/home/manojTest/py-faster-rcnn',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'NMS': 0.3,
'PROPOSAL_METHOD': 'selective_search',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': True,
'BATCH_SIZE': 128,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': False,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'HAS_RPN': False,
'IMS_PER_BATCH': 2,
'MAX_SIZE': 1000,
'PROPOSAL_METHOD': 'rpn',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 16,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_INFIX': 'stage1',
'SNAPSHOT_ITERS': 10000,
'USE_FLIPPED': True,
'USE_PREFETCH': False},
'USE_GPU_NMS': True}
Loaded dataset voc_2007_trainval for training
Set proposal method: rpn
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /home/py-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl
loading /home/py-faster-rcnn/output/faster_rcnn_alt_opt/voc_2007_trainval/zf_rpn_stage1_iter_80000_proposals.pkl
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "./tools/train_faster_rcnn_alt_opt.py", line 189, in train_fast_rcnn
roidb, imdb = get_roidb(imdb_name, rpn_file=rpn_file)
File "./tools/train_faster_rcnn_alt_opt.py", line 67, in get_roidb
roidb = get_training_roidb(imdb)
File "/home/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 122, in get_training_roidb
imdb.append_flipped_images()
File "/home/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 111, in append_flipped_images
assert (boxes[:, 2] >= boxes[:, 0]).all()
AssertionError

Would you happen to have any idea about this? Does this mean I have to have the flipped images of my original images in the directory? I've just put the same format for all the images of my dataset as the files and folder structure in the COCO dataset. I wasn't aware of any flipped images.

djdam · 2017-05-24T11:46:19Z

Hi

I think there are two ways to deal with your problem:

you probably have an incorrect ground truth box in your annotations file. Check for every box that X2>X1 and Y2>Y1
if you can 't figure it out, you can also train without flipped images by setting the config parameter TRAIN.USE_FLIPPED to false

sohamghoshmusigma · 2017-05-24T12:14:00Z

@djdam

You have been extremely helpful, thank you so much for the help you have provided on this topic.

I am going to check if all my annotations are correct.

Can you tell me one more thing, in case I have to change the TRAIN.USE_FLIPPED to false, inmy faster_rcnn_ene2end.yml under train there is no USE_FLIPPED, so should I just add a line there with USE_FLIPPED = False or should I be changing the config.py file? Although on the config.py file it says not to edit by hand.

Once again, thank you so much for your help on the matter :)

djdam · 2017-05-24T12:23:53Z

you're welcome! You should edit the config file, never the config.py. Your config file is in yml format, so you should add it like: USE_FLIPPED: True, under the TRAIN section.

Let me know how it is going

sohamghoshmusigma · 2017-05-24T13:13:28Z

Could you tell me one more thing though how to start the training from the part where the RPN proposals are already saved?
If I start now, it starts from the part where it has to converge the losses again, but I do have a file saved for the training to the part where the RPN proposals are made.

djdam · 2017-05-24T13:29:44Z

I made my own custom scripts to do that.. just copy the existing train.py and alter it.

wubaorong · 2017-11-27T02:41:52Z

@djdam I have used USE_FLIPPED,but I still meet the error in bbox_transform.py
my "dw,dh,dx,dy" are getting bigger

ujsyehao · 2017-11-30T03:13:28Z

@wubaorong push the error screenshot

wubaorong · 2017-12-01T00:59:54Z

@ujsyehao I don't save the error screenshot,but I copy the error information
I1120 20:15:42.906991 3777 solver.cpp:229] Iteration 0, loss = 4.52421
I1120 20:15:42.907033 3777 solver.cpp:245] Train net output #0: bbox_loss = 0.024606 (* 1 = 0.024606 loss)
I1120 20:15:42.907042 3777 solver.cpp:245] Train net output #1: cls_loss = 3.55821 (* 1 = 3.55821 loss)
I1120 20:15:42.907047 3777 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.683348 (* 1 = 0.683348 loss)
I1120 20:15:42.907052 3777 solver.cpp:245] Train net output #3: rpn_loss_bbox = 0.236713 (* 1 = 0.236713 loss)
I1120 20:15:42.907059 3777 sgd_solver.cpp:106] Iteration 0, lr = 0.001
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Floating Point Exception

tongpinmo · 2017-12-21T13:57:09Z

@wubaorong floating point exception ,have you ever solve the problem ,the training data I used is fast_rcnn_models

wubaorong · 2017-12-22T02:02:59Z

@tongpinmo I solve this problem by decreasing the learning rate in solver.prototxt

amlandas78 · 2018-05-10T10:17:06Z

@sohamghoshmusigma Have you solved the problem? Did you find any errors in the annotation files?

jiaxiaoharry · 2018-11-09T02:54:21Z

@tongpinmo I solve this problem by decreasing the learning rate in solver.prototxt

It solves my problem. Thank you!!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running the faster_rcnn_end2end.sh to train my own network #574

Error running the faster_rcnn_end2end.sh to train my own network #574

sohamghoshmusigma commented May 12, 2017 •

edited

Loading

djdam commented May 20, 2017

sohamghoshmusigma commented May 24, 2017

djdam commented May 24, 2017 •

edited

Loading

sohamghoshmusigma commented May 24, 2017

djdam commented May 24, 2017

sohamghoshmusigma commented May 24, 2017

djdam commented May 24, 2017

wubaorong commented Nov 27, 2017

ujsyehao commented Nov 30, 2017

wubaorong commented Dec 1, 2017

tongpinmo commented Dec 21, 2017 •

edited

Loading

wubaorong commented Dec 22, 2017

amlandas78 commented May 10, 2018 •

edited

Loading

jiaxiaoharry commented Nov 9, 2018

Error running the faster_rcnn_end2end.sh to train my own network #574

Error running the faster_rcnn_end2end.sh to train my own network #574

Comments

sohamghoshmusigma commented May 12, 2017 • edited Loading

djdam commented May 20, 2017

sohamghoshmusigma commented May 24, 2017

djdam commented May 24, 2017 • edited Loading

sohamghoshmusigma commented May 24, 2017

djdam commented May 24, 2017

sohamghoshmusigma commented May 24, 2017

djdam commented May 24, 2017

wubaorong commented Nov 27, 2017

ujsyehao commented Nov 30, 2017

wubaorong commented Dec 1, 2017

tongpinmo commented Dec 21, 2017 • edited Loading

wubaorong commented Dec 22, 2017

amlandas78 commented May 10, 2018 • edited Loading

jiaxiaoharry commented Nov 9, 2018

sohamghoshmusigma commented May 12, 2017 •

edited

Loading

djdam commented May 24, 2017 •

edited

Loading

tongpinmo commented Dec 21, 2017 •

edited

Loading

amlandas78 commented May 10, 2018 •

edited

Loading