Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running the faster_rcnn_end2end.sh to train my own network #574

Open
sohamghoshmusigma opened this issue May 12, 2017 · 14 comments
Open

Comments

@sohamghoshmusigma
Copy link

sohamghoshmusigma commented May 12, 2017

Hi,

I'm facing problems running the training according to the instructions on the home page wiki.

The training starts and then it it throws out of memory error. To prevent that, I changed my batch size to 1. Even then it throws an error.

The command I"m trying to run is **./experiments/scripts/faster_rcnn_end2end.sh 0 VGG16 pascal_voc
**

I'm reproducing the last part of the error here :

I0512 12:39:29.638978 28435 net.cpp:228] input-data does not need backward computation.
I0512 12:39:29.638983 28435 net.cpp:270] This network produces output loss_bbox
I0512 12:39:29.638989 28435 net.cpp:270] This network produces output loss_cls
I0512 12:39:29.638996 28435 net.cpp:270] This network produces output rpn_cls_loss
I0512 12:39:29.639003 28435 net.cpp:270] This network produces output rpn_loss_bbox
I0512 12:39:29.639050 28435 net.cpp:283] Network initialization done.
I0512 12:39:29.639220 28435 solver.cpp:60] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/VGG16.v2.caffemodel
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432430
I0512 12:39:30.745797 28435 net.cpp:816] Ignoring source layer pool5
I0512 12:39:30.856762 28435 net.cpp:816] Ignoring source layer fc8
I0512 12:39:30.856811 28435 net.cpp:816] Ignoring source layer prob
Solving...
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:166: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
fg_inds = npr.choice(fg_inds, size=fg_rois_per_this_image, replace=False)
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:177: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
bg_inds = npr.choice(bg_inds, size=bg_rois_per_this_image, replace=False)
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:184: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
labels[fg_rois_per_this_image:] = 0
F0512 12:39:31.373610 28435 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
./experiments/scripts/faster_rcnn_end2end.sh: line 57: 28435 Aborted (core dumped) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt --weights data/imagenet_models/${NET}.v2.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/faster_rcnn_end2end.yml ${EXTRA_ARGS}

Any help, suggestion or followup would be highly appreciated

@djdam
Copy link

djdam commented May 20, 2017

I think the "out of memory" part is self-explanatory? :) Training with the VGG16 dataset takes at least 6 GB of GPU memory. I am not able to run it as well, as I have 4.5 GB. You can try training with the ZF model, it takes less memory. Otherwise, go to AWS and setup a GPU cloud server, I did that as well.

@sohamghoshmusigma
Copy link
Author

Thanks @djdam
Yes, Indeed it was self explanatory, I was wondering if there was some work-around for that. But I did ZF like you suggested, and my training did start and it worked till the RPN

And then, it threw this error

Wrote RPN proposals to /home/py-faster-rcnn/output/faster_rcnn_alt_opt/voc_2007_trainval/zf_rpn_stage1_iter_80000_proposals.pkl

Stage 1 Fast R-CNN using RPN proposals, init from ImageNet model

Init model: data/imagenet_models/ZF.v2.caffemodel
RPN proposals: /home/py-faster-rcnn/output/faster_rcnn_alt_opt/voc_2007_trainval/zf_rpn_stage1_iter_80000_proposals.pkl
Using config:
{'DATA_DIR': '/home/py-faster-rcnn/data',
'DEDUP_BOXES': 0.0625,
'EPS': 1e-14,
'EXP_DIR': 'faster_rcnn_alt_opt',
'GPU_ID': 0,
'MATLAB': 'matlab',
'MODELS_DIR': '/home/py-faster-rcnn/models/pascal_voc',
'PIXEL_MEANS': array([[[ 102.9801, 115.9465, 122.7717]]]),
'RNG_SEED': 3,
'ROOT_DIR': '/home/manojTest/py-faster-rcnn',
'TEST': {'BBOX_REG': True,
'HAS_RPN': True,
'MAX_SIZE': 1000,
'NMS': 0.3,
'PROPOSAL_METHOD': 'selective_search',
'RPN_MIN_SIZE': 16,
'RPN_NMS_THRESH': 0.7,
'RPN_POST_NMS_TOP_N': 300,
'RPN_PRE_NMS_TOP_N': 6000,
'SCALES': [600],
'SVM': False},
'TRAIN': {'ASPECT_GROUPING': True,
'BATCH_SIZE': 128,
'BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'BBOX_NORMALIZE_MEANS': [0.0, 0.0, 0.0, 0.0],
'BBOX_NORMALIZE_STDS': [0.1, 0.1, 0.2, 0.2],
'BBOX_NORMALIZE_TARGETS': True,
'BBOX_NORMALIZE_TARGETS_PRECOMPUTED': False,
'BBOX_REG': True,
'BBOX_THRESH': 0.5,
'BG_THRESH_HI': 0.5,
'BG_THRESH_LO': 0.0,
'FG_FRACTION': 0.25,
'FG_THRESH': 0.5,
'HAS_RPN': False,
'IMS_PER_BATCH': 2,
'MAX_SIZE': 1000,
'PROPOSAL_METHOD': 'rpn',
'RPN_BATCHSIZE': 256,
'RPN_BBOX_INSIDE_WEIGHTS': [1.0, 1.0, 1.0, 1.0],
'RPN_CLOBBER_POSITIVES': False,
'RPN_FG_FRACTION': 0.5,
'RPN_MIN_SIZE': 16,
'RPN_NEGATIVE_OVERLAP': 0.3,
'RPN_NMS_THRESH': 0.7,
'RPN_POSITIVE_OVERLAP': 0.7,
'RPN_POSITIVE_WEIGHT': -1.0,
'RPN_POST_NMS_TOP_N': 2000,
'RPN_PRE_NMS_TOP_N': 12000,
'SCALES': [600],
'SNAPSHOT_INFIX': 'stage1',
'SNAPSHOT_ITERS': 10000,
'USE_FLIPPED': True,
'USE_PREFETCH': False},
'USE_GPU_NMS': True}
Loaded dataset voc_2007_trainval for training
Set proposal method: rpn
Appending horizontally-flipped training examples...
voc_2007_trainval gt roidb loaded from /home/py-faster-rcnn/data/cache/voc_2007_trainval_gt_roidb.pkl
loading /home/py-faster-rcnn/output/faster_rcnn_alt_opt/voc_2007_trainval/zf_rpn_stage1_iter_80000_proposals.pkl
Process Process-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "./tools/train_faster_rcnn_alt_opt.py", line 189, in train_fast_rcnn
roidb, imdb = get_roidb(imdb_name, rpn_file=rpn_file)
File "./tools/train_faster_rcnn_alt_opt.py", line 67, in get_roidb
roidb = get_training_roidb(imdb)
File "/home/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 122, in get_training_roidb
imdb.append_flipped_images()
File "/home/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 111, in append_flipped_images
assert (boxes[:, 2] >= boxes[:, 0]).all()
AssertionError

Would you happen to have any idea about this? Does this mean I have to have the flipped images of my original images in the directory? I've just put the same format for all the images of my dataset as the files and folder structure in the COCO dataset. I wasn't aware of any flipped images.

@djdam
Copy link

djdam commented May 24, 2017

Hi

I think there are two ways to deal with your problem:

  1. you probably have an incorrect ground truth box in your annotations file. Check for every box that X2>X1 and Y2>Y1
  2. if you can 't figure it out, you can also train without flipped images by setting the config parameter TRAIN.USE_FLIPPED to false

@sohamghoshmusigma
Copy link
Author

@djdam

You have been extremely helpful, thank you so much for the help you have provided on this topic.

I am going to check if all my annotations are correct.

Can you tell me one more thing, in case I have to change the TRAIN.USE_FLIPPED to false, inmy faster_rcnn_ene2end.yml under train there is no USE_FLIPPED, so should I just add a line there with USE_FLIPPED = False or should I be changing the config.py file? Although on the config.py file it says not to edit by hand.

Once again, thank you so much for your help on the matter :)

@djdam
Copy link

djdam commented May 24, 2017

you're welcome! You should edit the config file, never the config.py. Your config file is in yml format, so you should add it like: USE_FLIPPED: True, under the TRAIN section.

Let me know how it is going

@sohamghoshmusigma
Copy link
Author

Could you tell me one more thing though how to start the training from the part where the RPN proposals are already saved?
If I start now, it starts from the part where it has to converge the losses again, but I do have a file saved for the training to the part where the RPN proposals are made.

@djdam
Copy link

djdam commented May 24, 2017

I made my own custom scripts to do that.. just copy the existing train.py and alter it.

@wubaorong
Copy link

@djdam I have used USE_FLIPPED,but I still meet the error in bbox_transform.py
my "dw,dh,dx,dy" are getting bigger

@ujsyehao
Copy link

@wubaorong push the error screenshot

@wubaorong
Copy link

@ujsyehao I don't save the error screenshot,but I copy the error information
I1120 20:15:42.906991 3777 solver.cpp:229] Iteration 0, loss = 4.52421
I1120 20:15:42.907033 3777 solver.cpp:245] Train net output #0: bbox_loss = 0.024606 (* 1 = 0.024606 loss)
I1120 20:15:42.907042 3777 solver.cpp:245] Train net output #1: cls_loss = 3.55821 (* 1 = 3.55821 loss)
I1120 20:15:42.907047 3777 solver.cpp:245] Train net output #2: rpn_cls_loss = 0.683348 (* 1 = 0.683348 loss)
I1120 20:15:42.907052 3777 solver.cpp:245] Train net output #3: rpn_loss_bbox = 0.236713 (* 1 = 0.236713 loss)
I1120 20:15:42.907059 3777 sgd_solver.cpp:106] Iteration 0, lr = 0.001
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/wu/faster_rcnn/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Floating Point Exception

@tongpinmo
Copy link

tongpinmo commented Dec 21, 2017

@wubaorong floating point exception ,have you ever solve the problem ,the training data I used is fast_rcnn_models

@wubaorong
Copy link

@tongpinmo I solve this problem by decreasing the learning rate in solver.prototxt

@amlandas78
Copy link

amlandas78 commented May 10, 2018

@sohamghoshmusigma Have you solved the problem? Did you find any errors in the annotation files?

@jiaxiaoharry
Copy link

@tongpinmo I solve this problem by decreasing the learning rate in solver.prototxt

It solves my problem. Thank you!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants