Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning out-of-memory and lack of output #682

Closed
htzheng opened this issue Jul 13, 2014 · 12 comments
Closed

Finetuning out-of-memory and lack of output #682

htzheng opened this issue Jul 13, 2014 · 12 comments

Comments

@htzheng
Copy link

htzheng commented Jul 13, 2014

Hi, i plan to apply the pretrained ImageNet model for a 2-class classification task. So I need to modify the fc8 layer and further finetune the network. I follow shelhamer's suggestion in #186.

Here is what i do:

  1. prepare the input data
  2. change data of input layer and fc8 layer in imagenet_train.prototxt and imagenet_val.prototxt
  3. type finetune_net imagenet_solver.prototxt caffe_reference_imagenet_model in terminal (caffe_reference_imagenet_model is the 244MB pretrained model file)

after that, the terminal doesn't response for a long time without output.

I'm new to caffe, could someone tell me how to finetune a existing model. Thanks for any reply!

@htzheng
Copy link
Author

htzheng commented Jul 13, 2014

update:
in step (3) finetune_net imagenet_solver.prototxt caffe_reference_imagenet_model, it will show this info.

F0713 21:28:27.059324 3532 syncedmem.cpp:47] Check failed: error == cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x7fb1ce8da9fd google::LogMessage::Fail()
@ 0x7fb1ce8dc89d google::LogMessage::SendToLog()
@ 0x7fb1ce8da5ec google::LogMessage::Flush()
@ 0x7fb1ce8dd1be google::LogMessageFatal::~LogMessageFatal()
@ 0x447284 caffe::SyncedMemory::mutable_gpu_data()
@ 0x43c9d2 caffe::Blob<>::mutable_gpu_diff()
@ 0x4934f2 caffe::InnerProductLayer<>::Backward_gpu()
@ 0x42e403 caffe::Net<>::Backward()
@ 0x445bd7 caffe::Solver<>::Solve()
@ 0x40a0c8 main
@ 0x7fb1cbf56ec5 (unknown)
@ 0x40be37 (unknown)
Aborted (core dumped)

is there any clue?

@sguada
Copy link
Contributor

sguada commented Jul 13, 2014

The message is clear you are out of memory in your GPU card. So you will
need to reduce the batch_size

Sergio

2014-07-13 6:34 GMT-07:00 htzheng [email protected]:

update:
in step (3) finetune_net imagenet_solver.prototxt
caffe_reference_imagenet_model, it will show this info.

F0713 21:28:27.059324 3532 syncedmem.cpp:47] Check failed: error ==
cudaSuccess (2 vs. 0) out of memory
*** Check failure stack trace: ***
@ 0x7fb1ce8da9fd google::LogMessage::Fail()
@ 0x7fb1ce8dc89d google::LogMessage::SendToLog()
@ 0x7fb1ce8da5ec google::LogMessage::Flush()
@ 0x7fb1ce8dd1be google::LogMessageFatal::~LogMessageFatal()
@ 0x447284 caffe::SyncedMemory::mutable_gpu_data()
@ 0x43c9d2 caffe::Blob<>::mutable_gpu_diff()
@ 0x4934f2 caffe::InnerProductLayer<>::Backward_gpu()
@ 0x42e403 caffe::Net<>::Backward()
@ 0x445bd7 caffe::Solver<>::Solve()
@ 0x40a0c8 main
@ 0x7fb1cbf56ec5 (unknown)
@ 0x40be37 (unknown)
Aborted (core dumped)

is there any clue?


Reply to this email directly or view it on GitHub
#682 (comment).

@htzheng
Copy link
Author

htzheng commented Jul 14, 2014

@sguada Thank you! Now i change to a smaller batch_size and problem solved.

There is a tiny issue for finetune_net.bin on Ubuntu OS: LOG(INFO) does not print info on terminal. I change LOG(INFO) to std::cout. As observed, the finetuning code works well.

@shelhamer
Copy link
Member

Happy that you figured it out. For logging do

GLOG_logtostderr=1 finetune_net imagenet_solver.prototxt caffe_reference_imagenet_model

@shelhamer shelhamer changed the title Fine tuning issue Finetuning out-of-memory and lack of output Jul 14, 2014
@Prasanna1991
Copy link

I am trying to implement DeepFace model and the memory required for test 49724060 (with batch size, finally 1) and for train 49724060 (again with batch size 1).... This makes total memory required around 94 MB. But I am still having the 'out of memory' problem. I have got nvidia geforce GT 650M as my GPU. Viewing the status of GPU using $nvidia-smi -q, I can see that the total Memory (FB) is 2047 Mib and Free memory is 1646 MiB. Can anyone point me what am I blinkering??

@ToruHironaka
Copy link

I got the same GPU error problem when I trained imagenet example. Even though, the limitation of the batch size to 4 from 256 in train_val.prototxt, which fixed the short of my GPU memory issue. I had the second thought about changing solver_mode to CPU from GPU because my GPU had only 512MB of memory due to my old Macbook pro. CPU mode seemed to train without the problem. I think I should use GPU mode for massive parallel computing but I only have several PCs and laptops and I am just learning caffe for my study. Should I use GPU mode always? I think GPU mode can optimize more for accelerating computing speed but not for my caffe learning phase. However, I have to build a new PC for training caffe. Please give me advise for what type of GPUs are good enough for caffe. I am thinking to buy NVIDIA GeForce GTX960 with 2GB memory. I heard GPU with 3GB or above memory is sufficient for caffe.

@cervantes-loves-ai
Copy link

when i try install fast rcnn than i got like this error? how to slove it?

Loaded network /home/rvlab/Music/fast-rcnn/data/fast_rcnn_models/vgg16_fast_rcnn_iter_40000.caffemodel

Demo for data/demo/000004.jpg
F0718 22:09:35.547049 13693 syncedmem.cpp:51] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***
Aborted (core dumped)

@monajalal
Copy link

I get this error when running the following:
jalal@klein:~/computer_vision/py-faster-rcnn$ ./tools/demo.py

I0823 20:13:31.522610 40008 layer_factory.hpp:77] Creating layer relu5_1
I0823 20:13:31.522629 40008 net.cpp:106] Creating Layer relu5_1
I0823 20:13:31.522644 40008 net.cpp:454] relu5_1 <- conv5_1
I0823 20:13:31.522662 40008 net.cpp:397] relu5_1 -> conv5_1 (in-place)
I0823 20:13:31.522843 40008 net.cpp:150] Setting up relu5_1
I0823 20:13:31.522869 40008 net.cpp:157] Top shape: 1 512 14 14 (100352)
I0823 20:13:31.522883 40008 net.cpp:165] Memory required for data: 112795648
I0823 20:13:31.522891 40008 layer_factory.hpp:77] Creating layer conv5_2
I0823 20:13:31.522902 40008 net.cpp:106] Creating Layer conv5_2
I0823 20:13:31.522909 40008 net.cpp:454] conv5_2 <- conv5_1
I0823 20:13:31.522922 40008 net.cpp:411] conv5_2 -> conv5_2
I0823 20:13:31.529803 40008 net.cpp:150] Setting up conv5_2
I0823 20:13:31.529841 40008 net.cpp:157] Top shape: 1 512 14 14 (100352)
I0823 20:13:31.529849 40008 net.cpp:165] Memory required for data: 113197056
I0823 20:13:31.529868 40008 layer_factory.hpp:77] Creating layer relu5_2
I0823 20:13:31.529887 40008 net.cpp:106] Creating Layer relu5_2
I0823 20:13:31.529903 40008 net.cpp:454] relu5_2 <- conv5_2
I0823 20:13:31.529920 40008 net.cpp:397] relu5_2 -> conv5_2 (in-place)
F0823 20:13:31.530177 40008 cudnn_relu_layer.cpp:13] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0)  CUDNN_STATUS_INTERNAL_ERROR
*** Check failure stack trace: ***
Aborted (core dumped)

@sguada how should I reduce the batch size? in which file? can you show an example?

@jodusan
Copy link

jodusan commented Oct 11, 2016

@monajalal Have you figured out which file it was? :D

@onkarganjewar
Copy link

@Dulex123 @monajalal If you're using py-faster-rcnn then, you can change the batch_size in lib/fast_rcnn/config.py

Also, I would suggest you take a look at this issue. Hope this helps!

@wsz912
Copy link

wsz912 commented Mar 30, 2017

use small batch_size will work

@satyakesav
Copy link

I changed my batch_size from 128 to 32, but still it fails. So we need to build again after changing the config file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests