Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train Faster R-CNN on my own dataset ? #243

Open
JohnnyY8 opened this issue Jul 7, 2016 · 62 comments
Open

How to train Faster R-CNN on my own dataset ? #243

JohnnyY8 opened this issue Jul 7, 2016 · 62 comments

Comments

@JohnnyY8
Copy link

JohnnyY8 commented Jul 7, 2016

Hi everyone:
I want to train Faster R-CNN on my own dataset. Because Faster R-CNN does not use selective search method, I comment the code about selective. However, there are still some errors about roidb, and so on.
Can anybody help me ? I am not quite sure what should I do for training Faster R-CNN. It is a little complicated for me.
Thanks so much!

@ednarb29
Copy link

ednarb29 commented Jul 7, 2016

@JohnnyY8

Hi, I did the same thing. At first you should work through the code and check out, where which functions are called and you should try the demo.py. Afterwards in the readme is a section called "Beyond the demo" which explains the basic proceeding.

Additionally, you should search for issues in this repo. There are actually quite a lot similar issues that ask the same question.

Furthermore, here is a really good documentation of the "how to train on own dataset". This helped me a lot.

Finally, I'll sum up the main steps for you:

  1. Copy the structure of the pascal voc dataset into the FRCN_ROOT/data/, create a symbolic link and place your data in a similar manner as the pascal voc data set. That's actually the best way to prevent you from huge code changes in the following steps.
  2. Create a FRCN_ROOT/lib/datasets/.py and a _eval.py corresponding the pascal_voc.py and voc_eval.py
  3. Update the FRCN_ROOT/lib/datasets/factory.py by adding a new entry for your own dataset.
  4. Adapt the models under FRCN_ROOT/models/ by copying and changing an existing one like pascal_voc. Note, that you have to take care of the path within the solver and the amount of classes in the train and test prototxt. I can recommend you to start with the ZF model and the end2end algorithm. The alt_opt is more complex and better if you have more experience later.
  5. Create a config file under FRCN_ROOT/experiments/cfgs also by copying and updating an existing one.
  6. Create or update an experiment script under FRCN_ROOT/experiments/scripts by modifying it to your dataset
  7. Start training and testing by running the experiment script created in the previous step.

There are just the main steps I figured out during my work with the framework. It will take some time to get into it and several problems will occur by using the framework with your own dataset. The most problems are already addressed within other issues in this repo.

It might also be very helpful to use a python IDE that supports debugging.

Hope that helps. =)

@JohnnyY8
Copy link
Author

JohnnyY8 commented Jul 7, 2016

Hi @ednarb29 , thanks for you answer sincerely, I will try it now. Hope I can do it.
In addition, VID dataset has a lot of frames, more than one million. I am not quite sure if the code will create cache file for VID dataset ? Every time, it will takes me much time to load frames ?
Thank you again!

@ednarb29
Copy link

ednarb29 commented Jul 7, 2016

You can easily check that out, the file should be under FRCN_ROOT/data/cache/

Of course if this file is huge it needs some time even to load the cache file I guess. Maybe you should debug that. Naively you can delete the cache file and start training again. So you can compare the time it needs to create the dataset / load the cache file.

@JohnnyY8
Copy link
Author

JohnnyY8 commented Jul 7, 2016

Hi @ednarb29 , I have tried method you said. There are some errors about selective_search I can't handle like following.
image
In my opinion, Faster R-CNN doesn't use selective search, so I prefer to comment some codes about selective search such as "self.selective_search_roidb". But maybe it is not a right way to solve. Could you please give me some suggestions?

@tiepnh
Copy link

tiepnh commented Jul 8, 2016

@JohnnyY8 : Can you paste here your configuration information which are printed on terminal. I guess that your configuration file still choose the proposal method is selective search

@JohnnyY8
Copy link
Author

JohnnyY8 commented Jul 8, 2016

@tiepnh Hi! You are right. According to tutorial "https://github.com/deboc/py-faster-rcnn/tree/master/help", I use command ($ echo 'MODELS_DIR: "$PY_FASTER_RCNN/models"' >> config.yml) to generate config.yml. But if I change it to "experiments/cfgs/faster_rcnn_end2end.yml", it looks ok.

@JohnnyY8
Copy link
Author

JohnnyY8 commented Jul 8, 2016

@tiepnh @ednarb29 I can starting training, it looks close to right way. I will check it on validation set after finishing training. Thanks for you guys' help!!!
Another question is in factory.py like following. What does the split mean? If there are ["train", "val", "test"], what do they use for ? train for training, val and test for what ?
image

@tiepnh
Copy link

tiepnh commented Jul 8, 2016

@JohnnyY8 : This array will point to your image set files. As your pasted code, there are no image set file for testing or they use same image set for both training and testing.
Example: for the pascal_voc
The script file will call the this command for training
time ./tools/train_net.py --gpu ${GPU_ID} \ --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt \ --weights data/prdcv_models/${NET}.v2.caffemodel \ --imdb ${TRAIN_IMDB} \ --iters ${ITERS} \ --cfg experiments/cfgs/faster_rcnn_end2end.yml \ ${EXTRA_ARGS}
The TRAIN_IMDB is "voc_2007_trainval" => they will load all image in image set files ".....trainval.txt"
For the testing, they will use TEST_IMDB="voc_2007_test" => load image in image set file "....test.txt" to test the trained network

@JohnnyY8
Copy link
Author

JohnnyY8 commented Jul 8, 2016

@tiepnh Cool! Your answer is very useful and clear! Thanks so much!
That means the ground truth of PASCAL VOC 2007 test set is under "Annotaions" folder, right? Otherwise, it can't get mAP after finish training.
But I do not have the ground truth of VID test set and use TEST_IMDB="VID_val", does that mean it will test on validation set?

@JohnnyY8
Copy link
Author

JohnnyY8 commented Jul 9, 2016

@tiepnh Hi!
I use command to start training:

  • sudo ./tools/train_net.py --gpu 0 --iters 100000 --weights data/imagenet_models/ZF.v2.caffemodel --imdb VID_train --cfg ./experiments/cfgs/faster_rcnn_end2end.yml --solver models/pascal_voc/ZF/faster_rcnn_end2end/solver.prototxt

but still got following errors:
Traceback (most recent call last):
File "./tools/train_net.py", line 112, in
max_iters=args.max_iters)
File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 155, in train_net
roidb = filter_roidb(roidb)
File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 145, in filter_roidb
filtered_roidb = [entry for entry in roidb if is_valid(entry)]
File "/usr/local/caffes/xlw/faster-rcnn-third/tools/../lib/fast_rcnn/train.py", line 134, in is_valid
overlaps = entry['max_overlaps']
KeyError: 'max_overlaps'

Is there something wrong ?

@tiepnh
Copy link

tiepnh commented Jul 11, 2016

@JohnnyY8 :

That means the ground truth of PASCAL VOC 2007 test set is under "Annotaions" folder, right?
For both, test set/ train set, the ground truth of Pascal_voc is under Annotations.

For the TEST_IMDB, it just point to set of image use to test. So, if your use same image set for TRAIN_IMDB and TEST_IMDB, it will train and test the network in same dataset.
Secondly, you have to write your test function. See this tuto https://github.com/deboc/py-faster-rcnn/tree/master/lib/datasets

The error "max_overlaps" it seem that your data have no foreground ROI or background ROI. So, please check again your py file, which use to read your dataset

@JohnnyY8
Copy link
Author

@tiepnh Thank you so much! You are so nice.
I have found some bugs and restart training.
Let's waiting for the results.
Really, thanks for your help!

@JohnnyY8
Copy link
Author

@tiepnh @ednarb29 Hi!
I restarted training, but some strange problem occurred. I printed some path in train.txt, like this:
image
When I see the printed information in terminal, I notice that the data has been loaded for many times! My teammate and me are pretty sure it has finished the whole training set for at least once. But this information shows it start from 0000 again.
image
Could you please help me? We have loaded training data for more than 20 hours.
Thank you so much!

@ednarb29
Copy link

At first I would suggest you to start training and testing with a very little data set (100 images and 1k iterations), that you can debug the training and testing quite fast.

Does the problem occur during creation of the data set or during training?

@JohnnyY8
Copy link
Author

@ednarb29 I am not quite sure, several times before, I can load data about 2~4 hours (also load repeatly). But this time is stranger. We do not change any codes, just restart the training. The time for loading data is very long!

@JohnnyY8
Copy link
Author

@ednarb29 Do you just load data for once after start traing ?

@ednarb29
Copy link

I am not sure about that because this kind of problem did not occur for me... If I had problems with loading the data set I just removed the cache file and that solved the problem in most cases because changes on the original data set are not updated in the cache file. Sorry dude.

@deboc
Copy link

deboc commented Jul 13, 2016

Hi @JohnnyY8,
I completely agree with the idea of ednarb29, you should test with a (very) small dataset at first.
Moreover, I'm pretty sure that it's a bad idea to print anything for each data input. That may be the cause of the enormous additional loading time you got.

@JohnnyY8
Copy link
Author

@ednarb29 Not to be sorry, I should thank you!
I will remove the cache file and restart training! Really thanks for your help!

@JohnnyY8
Copy link
Author

@deboc That is right. I will try it. Thank you!
If I print anything, that will cause huge loading time ?

@deboc
Copy link

deboc commented Jul 13, 2016

I just bet it's not negligible.
You were saying the loading time had raised from 4h to 20h right ? What did you change beside adding this print ?

@JohnnyY8
Copy link
Author

@deboc Oh, I see. Only add print codes. So that is stranger for us.

@ednarb29
Copy link

ednarb29 commented Jul 13, 2016

Did removing the print command speed up the process?

And did removing the cache file and build the database again solve your problem with the KeyError: 'max_overlaps'?

@JohnnyY8
Copy link
Author

@ednarb29 I don't try to remove the print command. Because I really want to know the process, I guss this time consuming is negligible.
And removing the cache file works, my training restarts into iteration. Thanks a lot!

@ednarb29
Copy link

Cool, so if it works fine you can close the issue? =)

@JohnnyY8
Copy link
Author

@ednarb29 Sure, thank you very much!

@GeorgiAngelov
Copy link

GeorgiAngelov commented Jul 25, 2016

@deboc , I have a quick question. I get the following error when I executed the following command:

Command:
./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel --imdb inria_train --cfg config.yml

Error:

.....
I0725 04:10:00.437233  3494 net.cpp:816] Ignoring source layer conv4_3
I0725 04:10:00.437252  3494 net.cpp:816] Ignoring source layer relu4_3
I0725 04:10:00.437268  3494 net.cpp:816] Ignoring source layer pool4
I0725 04:10:00.437296  3494 net.cpp:816] Ignoring source layer conv5_1
I0725 04:10:00.437314  3494 net.cpp:816] Ignoring source layer relu5_1
I0725 04:10:00.437331  3494 net.cpp:816] Ignoring source layer conv5_2
I0725 04:10:00.437350  3494 net.cpp:816] Ignoring source layer relu5_2
I0725 04:10:00.437366  3494 net.cpp:816] Ignoring source layer conv5_3
I0725 04:10:00.437384  3494 net.cpp:816] Ignoring source layer relu5_3
I0725 04:10:00.437397  3494 net.cpp:816] Ignoring source layer conv5_3_relu5_3_0_split
I0725 04:10:00.437405  3494 net.cpp:816] Ignoring source layer roi_pool5
F0725 04:10:00.737687  3494 net.cpp:829] Cannot copy param 0 weights from layer 'fc6'; shape mismatch.  Source param shape is 4096 25088 (102760448); target param shape is 4096 18432 (75497472). To learn this layer's parameters from scratch rather than copying from a saved net, rename the layer.
*** Check failure stack trace: ***

I read that there's basically a difference in the expected size that the network has been setup to expect. The one thing that I can imagine is that I am using the faster-rcnn VGG16 model( data/faster_rcnn_models/VGG16_faster_rcnn_final.caffemodel )? Is it possible to use this model instead of the one you mentioned( data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel ) ?

P.S. Thank you for that awesome tutorial !

@deboc
Copy link

deboc commented Jul 25, 2016

Hi GeorgiAngelov,
I see you are using a final faster-rcnn caffemodel as pretrained network, but those ones doesn't have any fc6 layer, thus your issue.
The classical way for another dataset would be to use a pretrained caffe classifier for your data, and use its train.prototxt to build a faster-rcnn model.
So I suggest you investigate which classifier was used in your pretrained model, and provide this caffemodel (e.g. VGG_CNN_M_1024.v2.caffemodel) instead of the faster-rcnn one in the weights option

@vaklyuenkov
Copy link

vaklyuenkov commented Nov 23, 2016

inds = np.reshape(inds, (-1, 2)) because of second demotion of reshaping is 2 you should use only even numbers of images in data set.

@dantp-ai
Copy link

dantp-ai commented Dec 1, 2016

@GeorgiAngelov The tutorial of @deboc uses the image_net model VGG_CNN_M_1024.v2.caffemodel. You can get it by following the steps here https://github.com/deboc/py-faster-rcnn#download-pre-trained-imagenet-models.

@arasharchor
Copy link

arasharchor commented Dec 14, 2016

@ednarb29

first I would suggest you to start training and testing with a very little data set (100 images and 1k iterations), that you can debug the training and testing quite fast.

Does the problem occur during creation of the data set or during training?

Thanks I had the same problem:

overlaps = entry['max_overlaps']
KeyError: 'max_overlaps'

I deleted the cache file and it is now running.

@MyVanitar
Copy link

@ednarb29

What tool should I should to create imdb files?

@ArturoDeza
Copy link

ArturoDeza commented Feb 16, 2017

@ednarb29 , removing cache file fixed problem for me regarding the max_overlaps

@MyVanitar
Copy link

@ArturoDeza
What tool/code have you used to make imdb file for training?

@ArturoDeza
Copy link

ArturoDeza commented Feb 16, 2017

@VanitarNordic , I don't think there's a quick recipe for that. I've been following this setup:
https://github.com/smallcorgi/Faster-RCNN_TF
You will have to modify some lines of code in the factory.py, and copy the pascal_voc.py file to your my_dataset.py file and modify the lines of code regarding the number of training classes. *Besides also annotating all your images with .xml files

@MyVanitar
Copy link

@ArturoDeza
Thanks, actually I have annotated files but I've stuck in imdb creation :-(

@ArturoDeza
Copy link

@VanitarNordic What is the error you've been getting? You should create a new issue with the error you get when you run the end2end training script, that way we can be more helpful.

@MyVanitar
Copy link

@ArturoDeza
No, but I don't understand the fact that when we have a custom dataset, then when the model should be trained on that?! because end to end training does not have the dataset input parameter.

@roshanpati
Copy link

Hi!
I am getting the following error:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "./tools/train_faster_rcnn_alt_opt.py", line 129, in train_rpn
max_iters=max_iters)
File "/home/siplab/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 160, in train_net
model_paths = sw.train_model(max_iters)
File "/home/siplab/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 101, in train_model
self.solver.step(1)
File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 144, in forward
blobs = self._get_next_minibatch()
File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 63, in _get_next_minibatch
return get_minibatch(minibatch_db, self._num_classes)
File "/home/siplab/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 22, in get_minibatch
assert(cfg.TRAIN.BATCH_SIZE % num_images == 0),
ZeroDivisionError: integer division or modulo by zero

Can anyone help me with that?

@medhani
Copy link

medhani commented Jun 9, 2017

I"m using INRIA Person data set. After running below command

./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml

I got a error
File "./tools/train_faster_rcnn_alt_opt.py", line 62
print 'Loaded dataset {:s} for training'.format(imdb.name)
^
SyntaxError: invalid syntax

Can you please let me know reason behind this error

@medhani
Copy link

medhani commented Jun 16, 2017

Do you have any solutions for this error?
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "tools/train_faster_rcnn_alt_opt.py", line 129, in train_rpn
max_iters=max_iters)
File "/home/medhani/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 160, in train_net
model_paths = sw.train_model(max_iters)
File "/home/medhani/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 101, in train_model
self.solver.step(1)
File "/home/medhani/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 144, in forward
blobs = self._get_next_minibatch()
File "/home/medhani/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 63, in _get_next_minibatch
return get_minibatch(minibatch_db, self._num_classes)
File "/home/medhani/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 27, in get_minibatch
assert(cfg.TRAIN.BATCH_SIZE % num_images == 0),
ZeroDivisionError: integer division or modulo by zero

Thanks

@Seanmatthews
Copy link

@medhani It's not finding any images, which means either the path to your images is wrong, or there are no images listed in your image set text file.

@medhani
Copy link

medhani commented Jun 16, 2017

Thank Sean,I feel like there is a problem with my annotation file.
screenshot from 2017-06-16 15 47 12

I'm training my network for spider detection. Annotations files are in .xml format. Is it the correct structure of the .xml file?

@medhani
Copy link

medhani commented Jun 18, 2017

@Roskgp96 Have you able find a solution for the below error?
line 27, in get_minibatch
assert(cfg.TRAIN.BATCH_SIZE % num_images == 0),
ZeroDivisionError: integer division or modulo by zero

@ivalab
Copy link

ivalab commented Jun 29, 2017

I used another modification of fasterrcnn in TF and it saves permutation into snapshots. In my case, I actually traced the code and found out that I was using an OLD permutation loaded with my snapshot. That means, if you modified the number of testing or training data, it is possible you would access outside the permutation array and return zero index, and then load nothing from roidb. A simply solution is to delete all snapshots or modify the permutation in your train_val.py after loaded. Hope it helps.

@jzyztzn
Copy link

jzyztzn commented Jul 11, 2017

@ivalab Thanks, when I delete all the .pyc files in the path "$FRCN/lib/",it can train well without the ZeroDivisionError. @medhani Have you solve the problem? You could also try this method。

@madhu-kt
Copy link

madhu-kt commented Sep 15, 2017

@deboc Apologies for digging up an old discussion topic, but you mentioned that we have the option to reuse a pre-trained model that already classifies our objects OR train our own model from scratch. Would that put any restrictions on how we train our faster R-CNN? Would the joint approximation (end-2-end) approach be better than the alternate training method?

@fireden
Copy link

fireden commented Dec 6, 2017

Hi,
I'm trying to train the net on my own dataset I have created, using video with microphone. It seems that I did everything as ednrab29 wrote (started from the model I've got from training VOC2007), but results a really surprising:

  1. Testing a picture from my dataset gives me porper region and class=microphone (the only class (+backround) I left during training) with 1.0 probability
  2. Testing a picture not from my dataset gives me nothing. That's can be explained I think by that my dataset is good enough and too small (hundreds of pics of one mic).
  3. What's really surprized me that any picture from voc dataset gives me bounding boxes of objects in voc dataset with microphone label and lesser probability.
    What have I done wrong?

@mantou22
Copy link

Excuse me.When I trained my own model, I used the model I trained to run demo.py to detect the graph. When the pixel was large (5000,3000), the results were all white include image.If the image pixel is not too large, there is no problem.What's the reason?(当我训练好自己的模型时,用自己训练的模型运行demo.py,去检测图形,当检测图片像素很大时(5000,3000),检测出来的结果是全白包括图片。如果图片像素不是太大,就不会出问题。请问这是什么原因?)

@JohnnyY8
Copy link
Author

JohnnyY8 commented Oct 4, 2018

@mantou22 sorry, I do not understand "the results were all white"?

@tjzjp
Copy link

tjzjp commented Oct 14, 2018

I"m using INRIA Person data set. After running below command

./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml

I got a error
File "./tools/train_faster_rcnn_alt_opt.py", line 62
print 'Loaded dataset {:s} for training'.format(imdb.name)
^
SyntaxError: invalid syntax

Can you please let me know reason behind this error

have you fixed it?
I met the same problem

@frk1993
Copy link

frk1993 commented Dec 8, 2018

I"m using INRIA Person data set. After running below command
./tools/train_faster_rcnn_alt_opt.py --gpu 0 --net_name INRIA_Person --weights data/imagenet_models/VGG_CNN_M_1024.v2.caffemodel --imdb inria_train --cfg config.yml
I got a error
File "./tools/train_faster_rcnn_alt_opt.py", line 62
print 'Loaded dataset {:s} for training'.format(imdb.name)
^
SyntaxError: invalid syntax
Can you please let me know reason behind this error

have you fixed it?
I met the same problem

Hey, I have the same problem. Have you fixed it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests