-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coco training problem #85
Comments
Please list your environment clearly, including CUDA version, Caffe version (please ensure you read the README and use the Caffe version we suggest); whether this situation is reproducible. And make sure your images are not corrupted. |
my linux version is Linux version 3.10.0-327.x86_64 ([email protected]) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Dec 29 19:54:05 CST 2015. I use cuda-7.5 and use the Caffe version README suggest, when i run ./experiments/scripts/rfcn_end2end_ohem.sh 1 ResNet-101 coco, the print like this
it quickly happens like above, can you please help to fix this, thanks! |
this situation is reproducible, i just follow the README, download the code, compile, download the coco2014 train and val data, do not change anything, just run and problem happens! i don't known where is wrong |
can you help me, i am confused for many days! |
May be it's numpy version. In minibatch sample, if you use numpy version higher than 1.10.1 and you use astype to transform float to int, you may get illegal value like -Nan. This will lead to 0 of positive samples and may cause the problem you encounter. |
@2017hack fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image).astype(np.int) it works well for normal number, but in training with ohem, the rois_per_image is np.inf, so if you do this , fg_rois_per_image will be a negative number. What you should do is using a if statement, like this: if rois_per_image == np.inf: By the way, in lib/roi_data_layer/minibatch.py line 26, there is the same problem, but it only matters if you use Selective Search |
@sherryxie1 Thank you very much! it perfectly solved my problem. My numpy version is 1.14.1 , higher than 1.10.1,so when i train faster r-cnn i meet "TypeError: 'numpy.float64' object cannot be interpreted as an index" problem. Following the solution given in py-faster-rcnn/issues/481, I use some astype to transform float to int and successfully train the faster r-cnn, as well as rfcn_end2end without ohem. However, when it comes to rfcn_end2end_ohem i get the bbox_loss=0 problem. Thank you for your solution! |
@VersionHX |
@Huangswust182 Yes i do! @sherryxie1's solution works well on my machine~ |
@VersionHX |
There doesn't seem to be a minbatch.py file in the tensorflow package to make these changes in? Where do I change it to int()? |
when i training use follow command,
./experiments/scripts/rfcn_end2end_ohem.sh 1 ResNet-101 coco
it print below
I0918 14:51:27.386899 22785 net.cpp:775] Ignoring source layer prob
Solving...
I0918 14:51:28.109838 22785 solver.cpp:228] Iteration 0, loss = 5.13607
I0918 14:51:28.109882 22785 solver.cpp:244] Train net output #0: accuarcy = 0
I0918 14:51:28.109892 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:51:28.109900 22785 solver.cpp:244] Train net output #2: loss_cls = 4.4268 (* 1 = 4.4268 loss)
I0918 14:51:28.109906 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.698125 (* 1 = 0.698125 loss)
I0918 14:51:28.109913 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.0111453 (* 1 = 0.0111453 loss)
I0918 14:51:28.109922 22785 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
I0918 14:52:03.514792 22785 solver.cpp:228] Iteration 100, loss = 2.51846
I0918 14:52:03.514839 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:52:03.514849 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:52:03.514855 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:52:03.514861 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.559937 (* 1 = 0.559937 loss)
I0918 14:52:03.514868 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 1.95852 (* 1 = 1.95852 loss)
I0918 14:52:03.514873 22785 sgd_solver.cpp:106] Iteration 100, lr = 0.0005
I0918 14:52:39.039602 22785 solver.cpp:228] Iteration 200, loss = 0.823245
I0918 14:52:39.039641 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:52:39.039651 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:52:39.039657 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:52:39.039664 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.218862 (* 1 = 0.218862 loss)
I0918 14:52:39.039669 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.604383 (* 1 = 0.604383 loss)
I0918 14:52:39.039674 22785 sgd_solver.cpp:106] Iteration 200, lr = 0.0005
I0918 14:53:14.707566 22785 solver.cpp:228] Iteration 300, loss = 0.940254
I0918 14:53:14.707613 22785 solver.cpp:244] Train net output #0: accuarcy = 1
I0918 14:53:14.707623 22785 solver.cpp:244] Train net output #1: loss_bbox = 0 (* 1 = 0 loss)
I0918 14:53:14.707629 22785 solver.cpp:244] Train net output #2: loss_cls = 0 (* 1 = 0 loss)
I0918 14:53:14.707635 22785 solver.cpp:244] Train net output #3: rpn_cls_loss = 0.338957 (* 1 = 0.338957 loss)
I0918 14:53:14.707641 22785 solver.cpp:244] Train net output #4: rpn_loss_bbox = 0.601298 (* 1 = 0.601298 loss)
i don't know why this happen, please help!
The text was updated successfully, but these errors were encountered: