Incompatibilities in BatchNorm. #276

mathmanu · 2016-11-23T11:17:31Z

BatchNorm in NVIDIA/caffe is not compatible with BatchNorm in BVLC/caffe.

There is also no compatibility b/w engine:CAFFE and engine:CUDNN BatchNorm in NVIDIA/caffe itself. (Blob shapes are different).

Kindly fix these issues - so that we can use pre-trained models for fine tuning.

Please refer to:
NVIDIA/DIGITS#629
and
BVLC#3919
as well where similar issues are discussed.

I have some suggestions to fix these issues:

Rename the NVIDIA/caffe's BatchNorm to BatchNormScale, since it now includes Scaling as well.
Put a check/exit in CUDNN BatchNormScale reshape function, if the top and bottom blobs are same - so that the user will get a warning.
Fix the inconsistency in blob shape between engine:CAFFE and engine:CUDNN
Currenty I have to specify so many parameters in the new BatchNorm layer. Thi is un-necessary.

layer {
  name: "bn_conv1"
  bottom: "conv1"
  top: "conv1"
  type: "BatchNorm"
  param { #scale
    lr_mult: 1
    decay_mult: 1
  }
  param { #shift/bias
    lr_mult: 1
    decay_mult: 1
  } 
  param { #global mean
    lr_mult: 0
    decay_mult: 0
  }
  param { #global var
    lr_mult: 0
    decay_mult: 0
  }

  batch_norm_param {
    scale_filler {
    type: "constant"
    value: 1
    }
    bias_filler {
      type: "constant"
      value: 0
    }
    engine: CUDNN
  }
}

(4a). In BatchNormScale, If you change the oder of the blobs to: gloabl_mean, and global_variance, scale, bias, global_counter, then I don't have to specify 4 param fields for lr_mult and decay_mult - but only 2.

(4b). If the definition of scale and bias fields in BatchNormParameter is changed to:
optional float scale_filler = 5 [default = 1];
optional float bias_filler = 6 [default = 0];
Then I don't have to specify these also in the prototxt.

Keep the original BatchNorm from BVLC/caffe as it is, un-touched - so that compatibility to BVLC/caffe is not affected and old BVLC/caffe models can be used for fine tuning. If possible, provide a CUDNN version of this original BatchNorm without scaling as well, so that it can be accelerated.

The text was updated successfully, but these errors were encountered:

RSly · 2017-02-06T20:17:06Z

hi,
there are some valuable suggestions here.

I wonder if anything has improved with regards to batch normalization in NVCaffe?

thanks

lukeyeager · 2017-02-06T20:54:43Z

/cc @borisgin

borisgin · 2017-02-06T22:07:26Z

I wrote new BatchNorm which is

backward compatible both with previous nvcaffe-0.15 and with BVLC caffe .prototxt and snapshots (can read both)
supports BN in-place (the same input and output layers)
no need to specify lr_mult... in .prototxt.
supports cudnn 6

It will be released with new nvcaffe

RSly · 2017-02-07T09:36:11Z

Thanks @borisgin
Is it possible to already test it on a development branch?
Also with all the different versions, it would be nice to create and maintain a BN layer definition template :)

mathmanu · 2017-03-24T07:01:20Z

Thanks @borisgin for the update.

Will there be reverse compatibility as well i.e. can the NVIDIA/caffe CUDNN trained models be used for fine tuning in BVLC/caffe CPU and GPU modes? This is because there are several frameworks (such as FasterRCNN, SSD etc) built around BVLC/caffe, and I am wondering whether they will understand this new model trained in NVIDIA/caffe.

If complete compatibility is not there, isn't better to give this layer a new name, such as BatchNormScale, and make sure that the layers BatchNorm and BatchNormScale co-exist. That way we will be able to check the type and write a utility to do the reverse conversion, if needed.

borisgin · 2017-03-25T15:51:00Z

You can use old BVLC prototxt and models with NV-caffe, so you can train old models on new nv-caffe. you can't load new nvcaffe models into BVLC caffe since BVLC caffe does not have fused BN and scale layer. having different name for the layer would not help, since for this you should have such layer in BVLC caffe. it would be much simpler just to replace old BVLC BN layer with new one from Nvcaffe.

fslzj · 2017-04-25T16:00:45Z

The gpu_diff() of blobs[2] and blobs[3] should be always 0. It is necessary set the 3rd and 4th param (blobs_[2] and blobs[3]) as param{ lr_mult:0 decay_mult:0}?
In addition, what does the blobs_[4] represent? It seems not be never used.

szm-R · 2017-10-07T19:45:12Z

Hi, @borisgin did you merged your batch norm layer with nvcaffe? because I can't still use models trained with bvlc on NV and it's really annoying :(

borisgin · 2017-10-07T20:29:24Z

Yes, it merged, Can you send a link on the Model which you use, please?

szm-R · 2017-10-08T07:47:35Z

The version of NVcaffe I'm using is 0.15.14 and this is one of the models I've tested which results in this error:
F1008 11:16:40.578579 5197 net.cpp:797] Check failed: target_blobs.size() == source_layer.blobs_size() (5 vs. 3) Incompatible number of blobs for layer conv1/7x7_s2/bn

borisgin · 2017-10-08T20:37:33Z

This is very old branch. Did you try the latest branch (caffe-0.16)?

…

On Sun, Oct 8, 2017 at 12:47 AM, szm2015 ***@***.***> wrote: The version of NVcaffe I'm using is 0.15.14 and this <https://github.com/lim0606/caffe-googlenet-bn> is one of the models I've tested which results in this error: F1008 11:16:40.578579 5197 net.cpp:797] Check failed: target_blobs.size() == source_layer.blobs_size() (5 vs. 3) Incompatible number of blobs for layer conv1/7x7_s2/bn — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHMWqX72tMqaH9ke6H0NygqOJX1TCx63ks5sqH4agaJpZM4K6dqw> .

szm-R · 2017-10-09T09:55:35Z

Well the first and most important reason I'm using this branch is that according to BuildCaffe.md DIGITS is currently compatible with Caffe 0.15 and as I use DIGITS it seems like I don't have any other choice (do I?). The second reason is that I have also tried to build NVcaffe 0.16.4, but it gives me this error:

/usr/include/c++/5/bits/hashtable.h(1526): error: no instance of overloaded function "std::forward" matches the argument list

drnikolaev · 2017-10-09T10:08:03Z

Hi @szm2015 - what Ubuntu and what GCC do you use? What particular command do you run to build NVCaffe?

szm-R · 2017-10-09T12:47:13Z

I'm using Ubuntu 16.04.3 LTS and gcc 5.4 and after cloning the NVcaffe I use the following commands to build it:
mkdir build && cd build
cmake ..
make

Here's the build summary:

-- * Caffe Configuration Summary *
-- General:
-- Version : 0.16.4
-- Git : v0.16.4-2-gcdb3d9a
-- System : Linux
-- C++ compiler : /usr/bin/c++
-- Release CXX flags : -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
-- Debug CXX flags : -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized
-- Build type : Release

-- BUILD_SHARED_LIBS : ON
-- BUILD_python : ON
-- BUILD_matlab : OFF
-- BUILD_docs : ON
-- CPU_ONLY : OFF
-- USE_OPENCV : ON
-- USE_LEVELDB : ON
-- USE_LMDB : ON
-- ALLOW_LMDB_NOLOCK : OFF
-- TEST_FP16 : OFF

-- Dependencies:
-- BLAS : Yes (Atlas)
-- Boost : Yes (ver. 1.58)
-- glog : Yes
-- gflags : Yes
-- protobuf : Yes (ver. 3.4.0)
-- lmdb : Yes (ver. 0.9.17)
-- LevelDB : Yes (ver. 1.18)
-- Snappy : Yes (ver. 1.1.3)
-- OpenCV : Yes (ver. 3.2.0)
-- CUDA : Yes (ver. 8.0)

-- NVIDIA CUDA:
-- Target GPU(s) : Auto
-- GPU arch(s) : sm_50
-- cuDNN : Yes (ver. 6.0)
-- NCCL : Not found
-- NVML : /usr/lib/nvidia-375/libnvidia-ml.so

-- Python:
-- Interpreter : /usr/bin/python2.7 (ver. 2.7.12)
-- Libraries : /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
-- NumPy : /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.11.0)

-- Documentaion:
-- Doxygen : No
-- config_file :

-- Install:
-- Install path : /home/szm/Work/Caffe/nv-caffe/build/install

-- Configuring done
-- Generating done
-- Build files have been written to: /home/szm/Work/Caffe/nv-caffe/build

cliffwoolley · 2017-10-09T14:14:39Z

DIGITS 5 and 6 do work with NVCaffe 0.16, FWIW. Sounds like that document needs updating. On Oct 9, 2017 5:47 AM, "szm2015" <[email protected]> wrote: I'm using Ubuntu 16.04.3 LTS and gcc 5.4 and after cloning the NVcaffe I use the following commands to build it: mkdir build && cd build cmake .. make Here's the build summary: -- ******************* Caffe Configuration Summary ******************* -- General: -- Version : 0.16.4 -- Git : v0.16.4-2-gcdb3d9a -- System : Linux -- C++ compiler : /usr/bin/c++ -- Release CXX flags : -O3 -DNDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized -- Debug CXX flags : -g -DDEBUG -fPIC -Wall -std=c++11 -Wno-sign-compare -Wno-uninitialized -- Build type : Release -- BUILD_SHARED_LIBS : ON -- BUILD_python : ON -- BUILD_matlab : OFF -- BUILD_docs : ON -- CPU_ONLY : OFF -- USE_OPENCV : ON -- USE_LEVELDB : ON -- USE_LMDB : ON -- ALLOW_LMDB_NOLOCK : OFF -- TEST_FP16 : OFF -- Dependencies: -- BLAS : Yes (Atlas) -- Boost : Yes (ver. 1.58) -- glog : Yes -- gflags : Yes -- protobuf : Yes (ver. 3.4.0) -- lmdb : Yes (ver. 0.9.17) -- LevelDB : Yes (ver. 1.18) -- Snappy : Yes (ver. 1.1.3) -- OpenCV : Yes (ver. 3.2.0) -- CUDA : Yes (ver. 8.0) -- NVIDIA CUDA: -- Target GPU(s) : Auto -- GPU arch(s) : sm_50 -- cuDNN : Yes (ver. 6.0) -- NCCL : Not found -- NVML : /usr/lib/nvidia-375/libnvidia-ml.so -- Python: -- Interpreter : /usr/bin/python2.7 (ver. 2.7.12) -- Libraries : /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12) -- NumPy : /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.11.0) -- Documentaion: -- Doxygen : No -- config_file : -- Install: -- Install path : /home/szm/Work/Caffe/nv-caffe/build/install

…

-- Configuring done -- Generating done -- Build files have been written to: /home/szm/Work/Caffe/nv-caffe/build — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJO93lTDGIgOlVf4_KQPpS1QN6ldAy42ks5sqhXUgaJpZM4K6dqw> .

szm-R · 2017-10-09T18:28:33Z

If it's as @cliffwoolley says, then I will be really grateful if someone helps me with the problem with NVcaffe 16, Because apart from batch norm layer problem, it's been a couple of days that I have been unable to train an object detection model with DIGITS, the error is the same as the one mentioned in BVLC#1833. arthurlobo who has asked the question seems to have been able to resolve it by switching to NVcaffe 16.4. The thing is that I had to reinstall my whole OS recently and so I had to install everything from scratch, before that everything worked fine with the same versions of NVcaffe and probably DIGITS (I'm sure that it was version 6 but I'm not sure about the exact version).

drnikolaev · 2017-10-10T04:23:06Z

Hi @szm2015
This error: /usr/include/c++/5/bits/hashtable.h(1526): error: no instance of overloaded function "std::forward" matches the argument list - could you paste it along with the rest including the make invocation? Also, may I ask you to try CUDA 9?

szm-R · 2017-10-10T05:11:28Z

Hi @drnikolaev , you mean the whole error? I'm not sure what you mean by the "make invocation". This is the complete error list that stops make:

/usr/include/c++/5/bits/hashtable.h(1526): error: no instance of overloaded function "std::forward" matches the argument list
argument types are: (int)
detected during:
instantiation of "std::pair<std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::iterator, __nv_bool> std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::_M_emplace(std::true_type, _Args &&) [with _Key=int, _Value=std::pair<const int, boost::shared_ptrcaffe::GPUMemory::Workspace>, _Alloc=std::allocator<std::pair<const int, boost::shared_ptrcaffe::GPUMemory::Workspace>>, _ExtractKey=std::__detail::_Select1st, _Equal=std::equal_to, _H1=std::hash, _H2=std::__detail::_Mod_range_hashing, _Hash=std::__detail::_Default_ranged_hash, _RehashPolicy=std::__detail::_Prime_rehash_policy, _Traits=std::__umap_traits, _Args=<int &, boost::shared_ptrcaffe::GPUMemory::Workspace>]"
(726): here
instantiation of "std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::__ireturn_type std::_Hashtable<_Key, _Value, _Alloc, _ExtractKey, _Equal, _H1, _H2, _Hash, _RehashPolicy, _Traits>::emplace(_Args &&...) [with _Key=int, _Value=std::pair<const int, boost::shared_ptrcaffe::GPUMemory::Workspace>, _Alloc=std::allocator<std::pair<const int, boost::shared_ptrcaffe::GPUMemory::Workspace>>, _ExtractKey=std::__detail::_Select1st, _Equal=std::equal_to, _H1=std::hash, _H2=std::__detail::_Mod_range_hashing, _Hash=std::__detail::_Default_ranged_hash, _RehashPolicy=std::__detail::_Prime_rehash_policy, _Traits=std::__umap_traits, _Args=<int &, boost::shared_ptrcaffe::GPUMemory::Workspace>]"
/usr/include/c++/5/bits/unordered_map.h(380): here
instantiation of "std::pair<std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::iterator, __nv_bool> std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::emplace(_Args &&...) [with _Key=int, _Tp=boost::shared_ptrcaffe::GPUMemory::Workspace, _Hash=std::hash, _Pred=std::equal_to, _Alloc=std::allocator<std::pair<const int, boost::shared_ptrcaffe::GPUMemory::Workspace>>, _Args=<int &, boost::shared_ptrcaffe::GPUMemory::Workspace>]"
/home/szm/Work/Caffe/nv-caffe/include/caffe/layers/cudnn_conv_layer.hpp(35): here
instantiation of "T &caffe::map_ptr(int, caffe::PtrMap &, caffe::MutexVec &) [with T=caffe::GPUMemory::Workspace]"
/home/szm/Work/Caffe/nv-caffe/src/caffe/layers/cudnn_conv_layer.cu(15): here
instantiation of "void caffe::CuDNNConvolutionLayer<Ftype, Btype>::Forward_gpu(const std::vector<caffe::Blob *, std::allocator<caffe::Blob *>> &, const std::vector<caffe::Blob *, std::allocator<caffe::Blob *>> &) [with Ftype=float, Btype=float]"
/home/szm/Work/Caffe/nv-caffe/src/caffe/layers/cudnn_conv_layer.cu(207): here

1 error detected in the compilation of "/tmp/tmpxft_0000400d_00000000-7_cudnn_conv_layer.cpp1.ii".
CMake Error at cuda_compile_generated_cudnn_conv_layer.cu.o.cmake:262 (message):
Error generating file
/home/szm/Work/Caffe/nv-caffe/build/src/caffe/CMakeFiles/cuda_compile.dir/layers/./cuda_compile_generated_cudnn_conv_layer.cu.o

src/caffe/CMakeFiles/caffe.dir/build.make:147: recipe for target 'src/caffe/CMakeFiles/cuda_compile.dir/layers/cuda_compile_generated_cudnn_conv_layer.cu.o' failed
make[2]: *** [src/caffe/CMakeFiles/cuda_compile.dir/layers/cuda_compile_generated_cudnn_conv_layer.cu.o] Error 1
CMakeFiles/Makefile2:272: recipe for target 'src/caffe/CMakeFiles/caffe.dir/all' failed
make[1]: *** [src/caffe/CMakeFiles/caffe.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

About Cuda 9, I will try it as soon as I can and report the result.

szm-R · 2017-10-10T05:39:54Z

Hi everyone,
As the error with building NVcaffe 16 was with cuDNN so I thought that I might try a CPU only version to see how that works. This CPU only version built without an obvious error, but I still can't train an object detection model with it. The error has changed a bit but it's still something about the clustering layer. Here are the last lines of caffe_output.log:

I1010 09:00:52.221909 5509 layer_factory.hpp:136] Creating layer 'cluster' of type 'Python'
I1010 09:00:52.221911 5509 layer_factory.hpp:148] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT
I1010 09:00:52.221922 5509 layer_factory.cpp:325] Importing Python module 'caffe.layers.detectnet.clustering'
*** Aborted at 1507613452 (unix time) try "date -d @1507613452" if you are using GNU date ***
PC: @ 0x7fde31aaf873 std::_Hashtable<>::clear()
*** SIGSEGV (@0x9) received by PID 5509 (TID 0x7fde87f2f7c0) from PID 9; stack trace: ***
@ 0x7fde85ae24b0 (unknown)
@ 0x7fde31aaf873 std::_Hashtable<>::clear()
@ 0x7fde31aa1346 google::protobuf::DescriptorPool::FindFileByName()
@ 0x7fde31a7fac8 google::protobuf::python::cdescriptor_pool::AddSerializedFile()
@ 0x7fde8671a7d0 PyEval_EvalFrameEx
@ 0x7fde8684301c PyEval_EvalCodeEx
@ 0x7fde867993dd (unknown)
@ 0x7fde8676c1e3 PyObject_Call
@ 0x7fde8678cae5 (unknown)
@ 0x7fde86723123 (unknown)
@ 0x7fde8676c1e3 PyObject_Call
@ 0x7fde8671713c PyEval_EvalFrameEx
@ 0x7fde8684301c PyEval_EvalCodeEx
@ 0x7fde86711b89 PyEval_EvalCode
@ 0x7fde867a61b4 PyImport_ExecCodeModuleEx
@ 0x7fde867a6b8f (unknown)
@ 0x7fde867a8300 (unknown)
@ 0x7fde867a85c8 (unknown)
@ 0x7fde867a96db PyImport_ImportModuleLevel
@ 0x7fde86720698 (unknown)
@ 0x7fde8676c1e3 PyObject_Call
@ 0x7fde86842447 PyEval_CallObjectWithKeywords
@ 0x7fde867155c6 PyEval_EvalFrameEx
@ 0x7fde8684301c PyEval_EvalCodeEx
@ 0x7fde86711b89 PyEval_EvalCode
@ 0x7fde867a61b4 PyImport_ExecCodeModuleEx
@ 0x7fde867a6b8f (unknown)
@ 0x7fde867a8300 (unknown)
@ 0x7fde867a85c8 (unknown)
@ 0x7fde867a96db PyImport_ImportModuleLevel
@ 0x7fde86720698 (unknown)
@ 0x7fde8676c1e3 PyObject_Call

drnikolaev · 2017-10-10T07:37:45Z

Let's step back to GPU for a moment... :) I see the problem. Fix is coming soon.

drnikolaev · 2017-10-10T08:42:39Z

@szm2015
This is a branch to verify the fix: https://github.com/drnikolaev/caffe/tree/caffe-0.16

szm-R · 2017-10-13T18:25:47Z

Hi @drnikolaev
I tried to build the Caffe in the link you mentioned but it gives me this error (along with so many others!):

/home/szm/Work/Caffe/nv-caffe_0.16.4_testVersion/src/caffe/layers/cudnn_conv_layer.cpp: In member function ‘void caffe::CuDNNConvolutionLayer<Ftype, Btype>::FindExConvAlgo(const std::vectorcaffe::Blob*&, const std::vectorcaffe::Blob*&)’:
/home/szm/Work/Caffe/nv-caffe_0.16.4_testVersion/src/caffe/layers/cudnn_conv_layer.cpp:870:29: error: ‘i’ was not declared in this scope
<< " " << fwd_algo_[i]
^
/home/szm/Work/Caffe/nv-caffe_0.16.4_testVersion/src/caffe/layers/cudnn_conv_layer.cpp: At global scope:
/home/szm/Work/Caffe/nv-caffe_0.16.4_testVersion/src/caffe/layers/cudnn_conv_layer.cpp:906:27: error: expected initializer before ‘<’ token
bool CuDNNConvolutionLayer<Ftype, Btype>::IsBottomDescChanged(

As a side point, I was at last able to get NVcaffe_0.15.14 work with DIGITS (the mysterious error regarding the clustering layer is gone). What I did was uninstalling everything and reinstalling from scratch. But I still need to get NVcaffe_0.16.4 to work so that I will be able to use BVCLcaffe trained BN models in DIGITS as well.

drnikolaev · 2017-10-13T19:24:56Z

It's already fixed, please pull again

szm-R · 2017-10-14T18:06:17Z

I did and was able to build it successfully but I have trouble getting it to work. When I try to deploy a model with BVLCcaffe type BN layers (via a cpp code I have written and use for deploying Caffe models) the program crashes when trying to load the model, more specifically at this line:
CNet.reset(new caffe::Net(deploy_file, caffe::TEST));

whereas the same model can run without a problem using BVLCcaffe (the link to the model I'm testing)

I tested the same code with a bvlc-googlenet and it works just fine.

drnikolaev · 2017-10-14T18:29:12Z

Hi @szm2015 , crash stack might help here but before getting there please consider adjusting this test to your net:

python/caffe/test/test_classification.py

szm-R · 2017-10-15T05:42:44Z

HI @drnikolaev , I tested the code via the following command:

python test_classification.py /home/gpuserver/Moosavi/Test/Models/googlenet_bn_stepsize_6400_iter_1200000/googlenet_bn_stepsize_6400_iter_1200000.caffemodel /home/gpuserver/Moosavi/Test/Models/googlenet_bn_stepsize_6400_iter_1200000/deploy.prototxt /home/gpuserver/Moosavi/Test/Input/ImageNet_val/FullPathFileList.txt --mean_file /home/gpuserver/Moosavi/Test/Models/googlenet_bn_stepsize_6400_iter_1200000/mean.binaryproto --labels_file /home/gpuserver/Moosavi/Test/Input/ImageNet_val/Labels2015.txt use_gpu

but nothing happens it just gives the prompt back after merely a second. I don't know exactly how may I get the stack trace but I think it's something like this, the list of functions called before the crash (when debugging with Qt):
1 std::map<int, cub::CachingDeviceAllocator::TotalBytes>::operator[](int const&) 0x7ffff7379864
2 caffe::GPUMemory::Manager::GetInfo(unsigned long *, unsigned long *, bool) 0x7ffff7376116
3 caffe::GPUMemory::Workspace::safe_reserve(unsigned long, int) 0x7ffff7377275
4 caffe::CuDNNConvolutionLayer<float, float>::AllocateWorkspace(unsigned long) 0x7ffff754aa80
5 caffe::CuDNNConvolutionLayer<float, float>::Reshape(std::vector<caffe::Blob *> const&, std::vector<caffe::Blob *> const&) 0x7ffff755c430
6 caffe::Net::Init(caffe::NetParameter const&) 0x7ffff761bf81
7 caffe::Net::Net(std::string const&, caffe::Phase, unsigned long, caffe::Flag *, caffe::Flag *, caffe::Net const *) 0x7ffff761e2f6
8 CaffeClassifier::LoadModel CaffeClassifier.cpp 27 0x41668c
9 CaffeClassifier::CaffeClassifier CaffeClassifier.cpp 16 0x416565
10 main main.cpp 9 0x4113e2

drnikolaev · 2017-10-16T00:49:26Z

@szm2015
Please add this code to the very beginning of your main.cpp

vector<int> gpus(number_of_your_gpus);
gpus[0] = 0;
gpus[1] = 1;
//etc. for each gpu...

caffe::GPUMemory::Scope gpu_memory_scope(gpus);

As of the python script - you need to adjust it to your model.

szm-R · 2017-10-16T06:07:36Z

I did and the model is now running without a problem. Thank you!

…

On Mon, Oct 16, 2017 at 4:19 AM, Sergei Nikolaev ***@***.***> wrote: @szm2015 <https://github.com/szm2015> Please add this code to the very beginning of your main.cpp vector<int> gpus(number_of_your_gpus); gpus[0] = 0; gpus[1] = 1; //etc. for each gpu... caffe::GPUMemory::Scope gpu_memory_scope(gpus); As of the python script - you need to adjust it to your model. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#276 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/APaJX6fHyVshqBpKNT5B9cPfqw0H9ftyks5ssqgZgaJpZM4K6dqw> .

szm-R · 2017-10-16T10:04:12Z

Well, I tested a little more and there still seems to be two problems. The code runs without crashing but the forward time required to process a 244x244 image is twice the one for the same model and the same image size running on BVLCcaffe. The second and most important problem is that the results are always the same for all of the images, the same top-5 predictions, whereas the one running on BVLCcaffe achieves about 91% top-5 accuracy.

drnikolaev · 2017-10-16T17:16:46Z

May I have your code to try?

szm-R · 2017-10-18T07:59:09Z

Hi everyone, sorry for the delay! I attached a modified version of my code (the original one is part of a bigger project written in Qt and has some irrelevant parts).
CaffeClassifcation.txt

I build this code using Caffe makefile (by copying the code to tool directory). It's executed by the following command:
./CaffeClassifcation /path/to/googlenet_bn_stepsize_6400_iter_1200000/deploy.prototxt /path/to/googlenet_bn_stepsize_6400_iter_1200000/googlenet_bn_stepsize_6400_iter_1200000.caffemodel Pixel /path/to/googlenet_bn_stepsize_6400_iter_1200000/mean.binaryproto /path/to/ImageNet_val/ FileList.txt Labels2015.txt

(I know it's very messy!!! but I just wanted to organize a code for testing!)
It's essentially the cpp_classification provided in Caffe with little changes. I tested the code on bvlc_googlenet and it gives me the correct accuracy, but on the googlenet_bn network, similar top 5 classes are predicted for all images, whereas the same model gives a top-1 accuracy of about 72% using BVLCcaffe.

The timing differences I mentioned in my previous comment is not specific to this version of NVcaffe. As a recurring pattern, I've seen that models run faster on BVLCcaffe but consume more GPU memory than the ones running on NVcaffe.

drnikolaev · 2017-10-18T08:13:12Z

but on the googlenet_bn network, similar top 5 classes are predicted for all images, whereas the same model gives a top-1 accuracy of about 72% using BVLCcaffe.

I'm not sure I follow this. Could you paste output of both runs (bvlc vs nv)?

szm-R · 2017-10-18T08:59:25Z

Well, this is the BVLCcaffe version of the code (there are some minor differences like declaring a template type for caffe::Net and things like that):
CaffeClassifcation.txt

Using the BVLCcaffe code, the output of running googlenet_bn model is this (for 30 first images):
CaffeClassification_BVLC.txt

But using the NVcaffe version gives this output (again for the first 30 images):
CaffeClassification_NV.txt

drnikolaev · 2018-01-01T09:40:43Z

Please check v0.16.5 and reopen the issue if the problem still exists.

RSly mentioned this issue Feb 6, 2017

What is the correct BatchNorm layer for digits? NVIDIA/DIGITS#1433

Closed

d-lareg mentioned this issue Apr 25, 2017

Cannot copy Batchnorm weights, shape mismatch. #307

Closed

flx42 mentioned this issue May 8, 2017

Problem when using model from DIGITS NVIDIA/gpu-rest-engine#11

Closed

drnikolaev closed this as completed May 16, 2017

drnikolaev mentioned this issue Jun 15, 2017

caffe-0.16 converges slower and produces lower accuracy (compared to caffe-0.15) #347

Closed

drnikolaev reopened this Oct 16, 2017

drnikolaev closed this as completed Jan 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incompatibilities in BatchNorm. #276

Incompatibilities in BatchNorm. #276

mathmanu commented Nov 23, 2016 •

edited by lukeyeager

Loading

RSly commented Feb 6, 2017

lukeyeager commented Feb 6, 2017

borisgin commented Feb 6, 2017 •

edited by lukeyeager

Loading

RSly commented Feb 7, 2017 •

edited

Loading

mathmanu commented Mar 24, 2017 •

edited

Loading

borisgin commented Mar 25, 2017

fslzj commented Apr 25, 2017

szm-R commented Oct 7, 2017

borisgin commented Oct 7, 2017

szm-R commented Oct 8, 2017

borisgin commented Oct 8, 2017 via email

szm-R commented Oct 9, 2017

drnikolaev commented Oct 9, 2017

szm-R commented Oct 9, 2017

cliffwoolley commented Oct 9, 2017 via email

szm-R commented Oct 9, 2017

drnikolaev commented Oct 10, 2017

szm-R commented Oct 10, 2017

szm-R commented Oct 10, 2017

drnikolaev commented Oct 10, 2017

drnikolaev commented Oct 10, 2017

szm-R commented Oct 13, 2017

drnikolaev commented Oct 13, 2017

szm-R commented Oct 14, 2017

drnikolaev commented Oct 14, 2017

szm-R commented Oct 15, 2017

drnikolaev commented Oct 16, 2017

szm-R commented Oct 16, 2017 via email

szm-R commented Oct 16, 2017

drnikolaev commented Oct 16, 2017

szm-R commented Oct 18, 2017

drnikolaev commented Oct 18, 2017

szm-R commented Oct 18, 2017

drnikolaev commented Jan 1, 2018

Incompatibilities in BatchNorm. #276

Incompatibilities in BatchNorm. #276

Comments

mathmanu commented Nov 23, 2016 • edited by lukeyeager Loading

RSly commented Feb 6, 2017

lukeyeager commented Feb 6, 2017

borisgin commented Feb 6, 2017 • edited by lukeyeager Loading

RSly commented Feb 7, 2017 • edited Loading

mathmanu commented Mar 24, 2017 • edited Loading

borisgin commented Mar 25, 2017

fslzj commented Apr 25, 2017

szm-R commented Oct 7, 2017

borisgin commented Oct 7, 2017

szm-R commented Oct 8, 2017

borisgin commented Oct 8, 2017 via email

szm-R commented Oct 9, 2017

drnikolaev commented Oct 9, 2017

szm-R commented Oct 9, 2017

-- BUILD_SHARED_LIBS : ON -- BUILD_python : ON -- BUILD_matlab : OFF -- BUILD_docs : ON -- CPU_ONLY : OFF -- USE_OPENCV : ON -- USE_LEVELDB : ON -- USE_LMDB : ON -- ALLOW_LMDB_NOLOCK : OFF -- TEST_FP16 : OFF

-- Dependencies: -- BLAS : Yes (Atlas) -- Boost : Yes (ver. 1.58) -- glog : Yes -- gflags : Yes -- protobuf : Yes (ver. 3.4.0) -- lmdb : Yes (ver. 0.9.17) -- LevelDB : Yes (ver. 1.18) -- Snappy : Yes (ver. 1.1.3) -- OpenCV : Yes (ver. 3.2.0) -- CUDA : Yes (ver. 8.0)

-- NVIDIA CUDA: -- Target GPU(s) : Auto -- GPU arch(s) : sm_50 -- cuDNN : Yes (ver. 6.0) -- NCCL : Not found -- NVML : /usr/lib/nvidia-375/libnvidia-ml.so

-- Python: -- Interpreter : /usr/bin/python2.7 (ver. 2.7.12) -- Libraries : /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12) -- NumPy : /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.11.0)

-- Documentaion: -- Doxygen : No -- config_file :

-- Install: -- Install path : /home/szm/Work/Caffe/nv-caffe/build/install

cliffwoolley commented Oct 9, 2017 via email

szm-R commented Oct 9, 2017

drnikolaev commented Oct 10, 2017

szm-R commented Oct 10, 2017

szm-R commented Oct 10, 2017

drnikolaev commented Oct 10, 2017

drnikolaev commented Oct 10, 2017

szm-R commented Oct 13, 2017

drnikolaev commented Oct 13, 2017

szm-R commented Oct 14, 2017

drnikolaev commented Oct 14, 2017

szm-R commented Oct 15, 2017

drnikolaev commented Oct 16, 2017

szm-R commented Oct 16, 2017 via email

szm-R commented Oct 16, 2017

drnikolaev commented Oct 16, 2017

szm-R commented Oct 18, 2017

drnikolaev commented Oct 18, 2017

szm-R commented Oct 18, 2017

drnikolaev commented Jan 1, 2018

mathmanu commented Nov 23, 2016 •

edited by lukeyeager

Loading

borisgin commented Feb 6, 2017 •

edited by lukeyeager

Loading

RSly commented Feb 7, 2017 •

edited

Loading

mathmanu commented Mar 24, 2017 •

edited

Loading

-- BUILD_SHARED_LIBS : ON
-- BUILD_python : ON
-- BUILD_matlab : OFF
-- BUILD_docs : ON
-- CPU_ONLY : OFF
-- USE_OPENCV : ON
-- USE_LEVELDB : ON
-- USE_LMDB : ON
-- ALLOW_LMDB_NOLOCK : OFF
-- TEST_FP16 : OFF

-- Dependencies:
-- BLAS : Yes (Atlas)
-- Boost : Yes (ver. 1.58)
-- glog : Yes
-- gflags : Yes
-- protobuf : Yes (ver. 3.4.0)
-- lmdb : Yes (ver. 0.9.17)
-- LevelDB : Yes (ver. 1.18)
-- Snappy : Yes (ver. 1.1.3)
-- OpenCV : Yes (ver. 3.2.0)
-- CUDA : Yes (ver. 8.0)

-- NVIDIA CUDA:
-- Target GPU(s) : Auto
-- GPU arch(s) : sm_50
-- cuDNN : Yes (ver. 6.0)
-- NCCL : Not found
-- NVML : /usr/lib/nvidia-375/libnvidia-ml.so

-- Python:
-- Interpreter : /usr/bin/python2.7 (ver. 2.7.12)
-- Libraries : /usr/lib/x86_64-linux-gnu/libpython2.7.so (ver 2.7.12)
-- NumPy : /usr/lib/python2.7/dist-packages/numpy/core/include (ver 1.11.0)

-- Documentaion:
-- Doxygen : No
-- config_file :

-- Install:
-- Install path : /home/szm/Work/Caffe/nv-caffe/build/install