-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error running the faster_rcnn_end2end.sh to train my own network #574
Comments
I think the "out of memory" part is self-explanatory? :) Training with the VGG16 dataset takes at least 6 GB of GPU memory. I am not able to run it as well, as I have 4.5 GB. You can try training with the ZF model, it takes less memory. Otherwise, go to AWS and setup a GPU cloud server, I did that as well. |
Thanks @djdam And then, it threw this error Wrote RPN proposals to /home/py-faster-rcnn/output/faster_rcnn_alt_opt/voc_2007_trainval/zf_rpn_stage1_iter_80000_proposals.pkl
Init model: data/imagenet_models/ZF.v2.caffemodel Would you happen to have any idea about this? Does this mean I have to have the flipped images of my original images in the directory? I've just put the same format for all the images of my dataset as the files and folder structure in the COCO dataset. I wasn't aware of any flipped images. |
Hi I think there are two ways to deal with your problem:
|
You have been extremely helpful, thank you so much for the help you have provided on this topic. I am going to check if all my annotations are correct. Can you tell me one more thing, in case I have to change the TRAIN.USE_FLIPPED to false, inmy faster_rcnn_ene2end.yml under train there is no USE_FLIPPED, so should I just add a line there with USE_FLIPPED = False or should I be changing the config.py file? Although on the config.py file it says not to edit by hand. Once again, thank you so much for your help on the matter :) |
you're welcome! You should edit the config file, never the config.py. Your config file is in yml format, so you should add it like: USE_FLIPPED: True, under the TRAIN section. Let me know how it is going |
Could you tell me one more thing though how to start the training from the part where the RPN proposals are already saved? |
I made my own custom scripts to do that.. just copy the existing train.py and alter it. |
@djdam I have used USE_FLIPPED,but I still meet the error in bbox_transform.py |
@wubaorong push the error screenshot |
@ujsyehao I don't save the error screenshot,but I copy the error information |
@wubaorong floating point exception ,have you ever solve the problem ,the training data I used is fast_rcnn_models |
@tongpinmo I solve this problem by decreasing the learning rate in solver.prototxt |
@sohamghoshmusigma Have you solved the problem? Did you find any errors in the annotation files? |
It solves my problem. Thank you!!! |
Hi,
I'm facing problems running the training according to the instructions on the home page wiki.
The training starts and then it it throws out of memory error. To prevent that, I changed my batch size to 1. Even then it throws an error.
The command I"m trying to run is **./experiments/scripts/faster_rcnn_end2end.sh 0 VGG16 pascal_voc
**
I'm reproducing the last part of the error here :
I0512 12:39:29.638978 28435 net.cpp:228] input-data does not need backward computation.
I0512 12:39:29.638983 28435 net.cpp:270] This network produces output loss_bbox
I0512 12:39:29.638989 28435 net.cpp:270] This network produces output loss_cls
I0512 12:39:29.638996 28435 net.cpp:270] This network produces output rpn_cls_loss
I0512 12:39:29.639003 28435 net.cpp:270] This network produces output rpn_loss_bbox
I0512 12:39:29.639050 28435 net.cpp:283] Network initialization done.
I0512 12:39:29.639220 28435 solver.cpp:60] Solver scaffolding done.
Loading pretrained model weights from data/imagenet_models/VGG16.v2.caffemodel
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:505] Reading dangerously large protocol message. If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432430
I0512 12:39:30.745797 28435 net.cpp:816] Ignoring source layer pool5
I0512 12:39:30.856762 28435 net.cpp:816] Ignoring source layer fc8
I0512 12:39:30.856811 28435 net.cpp:816] Ignoring source layer prob
Solving...
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:166: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
fg_inds = npr.choice(fg_inds, size=fg_rois_per_this_image, replace=False)
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:177: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
bg_inds = npr.choice(bg_inds, size=bg_rois_per_this_image, replace=False)
$/Spring_2017/manojTest/py-faster-rcnn/tools/../lib/rpn/proposal_target_layer.py:184: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
labels[fg_rois_per_this_image:] = 0
F0512 12:39:31.373610 28435 syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
./experiments/scripts/faster_rcnn_end2end.sh: line 57: 28435 Aborted (core dumped) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt --weights data/imagenet_models/${NET}.v2.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/faster_rcnn_end2end.yml ${EXTRA_ARGS}
Any help, suggestion or followup would be highly appreciated
The text was updated successfully, but these errors were encountered: