Remove Intel MKL dependency #16

rodrigob · 2013-12-20T15:47:39Z

It is mentioned in the install instruction that this is work in progress.
While at ICCV I quickly implemented a branch where I remplace matrix operations with Eigen3 calls, and random generators by Boost::random generators.
I hope this is not redundant with ongoing work on private branches.

The branch can be found at
https://github.com/rodrigob/caffe

I got things to compile, however I noticed that some tests fails (thanks for creating a non-trivial set of unit tests !).
I have not been able to compile a version with MKL to compare, but I can only assume that tests should not fail.

Current fails are

[ FAILED ] FlattenLayerTest/1.TestCPUGradient, where TypeParam = double
[ FAILED ] StochasticPoolingLayerTest/0.TestGradientGPU, where TypeParam = float
[ FAILED ] StochasticPoolingLayerTest/1.TestGradientGPU, where TypeParam = double
[ FAILED ] MultinomialLogisticLossLayerTest/1.TestGradientCPU, where TypeParam = double

which all sounds nasty (gradient computation errors in neural networks, big no no).

I will spend some time inspecting to see what goes wrong there, but any suggestion/comment/idea is welcome.

shelhamer · 2013-12-21T19:14:56Z

Thanks for your work on this! I have not had a chance to look at this in detail yet, but I can say this is not redundant with current efforts. I'll check back when I've had a closer look, but I look forward to seeing this as a pull request once it's polished.

lifeiteng · 2014-01-04T08:40:53Z

Intel MKL cannot be used on some kind liunx.
Looking for more work on this.

kloudkl · 2014-01-08T10:58:53Z

In src/caffe/util/math_functions.cpp line 289

// FIXME check if boundaries are handled in the same way ?
boost::uniform_real random_distribution(a, b);

No, boost:: and std::uniform_real interval is [a, b), while Intel MKL is [a, b]. Besides, boost::uniform_real is deprecated by uniform_real_distribution. How about this work around:

using boost::variate_generator;
using boost::mt19937;
using boost::random::uniform_real_distribution;
Caffe::random_generator_t &generator = Caffe::vsl_stream();
Dtype epsilon = 1e-5; // or 1e-4, 1e-6, different values may cause some tests to fail or pass
variate_generator<mt19937, uniform_real_distribution<Dtype> > rng(generator, uniform_real_distribution<Dtype>(a, b + epsilon));
do {
  r[i] = rng();
} while (r[i] > b);

rodrigob · 2014-01-08T12:03:16Z

Great to see this moving, and glad that you found/understood the source of
the problem.
Stackoverflow indicates that a less hacky way of fixing this is using
std::nextafter, or for better compatibility, using boost::math::nextafter

http://www.boost.org/doc/libs/1_53_0/libs/math/doc/sf_and_dist/html/math_toolkit/utils/next_float/nextafter.html

http://stackoverflow.com/questions/16224446/stduniform-real-distribution-inclusive-range

I am not a git guru (I am more of a hg guy), in which branch is
+openwzdhhttps://github.com/openwzdh working
on ?
Should I switch to that branch to try helping out, or import into my own ?

On Wed, Jan 8, 2014 at 11:58 AM, kloudkl [email protected] wrote:

In src/caffe/util/math_functions.cpp line 289

// FIXME check if boundaries are handled in the same way ?

boost::uniform_real random_distribution(a, b);

No, boost:: and std::uniform_real interval is [a, b), while Intel MKL is
[a, b]. Besides, boost::uniform_real is deprecated by
uniform_real_distribution. How about this work around:
using boost::variate_generator;
using boost::mt19937;
using boost::uniform_real_distribution;
Caffe::random_generator_t &generator = Caffe::vsl_stream();
Dtype epsilon = 1e-5; // or 1e-4, 1e-6, different values may cause some
tests to fail or pass
variate_generator > rng(generator, uniform_real_distribution(a, b +
epsilon));
do {
r[i] = rng();
} while (r[i] > b);

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-31822234
.

shelhamer · 2014-01-08T19:50:37Z

This is good progress. Thanks for the commit @rodrigob and debugging @kloudkl!

Let's develop this port in the boost-eigen branch I have just pushed. I have included the initial commit by @rodrigod.

To continue development, please make commits in your fork then pull request to this branch. I will review and merge the requests.

Please rebase any work on the latest bvlc/caffe boost-eigen before requesting a pull–I'd rather keep the history clean from merge noise.

tdomhan · 2014-01-22T16:39:08Z

Is the plan to completely get rid of MKL?
Just as a suggestion: it would be nice to be able to switch between different BLAS libraries, e.g. having a BLASFactory that spits out whatever BLAS library that is available on the system.

lifeiteng · 2014-01-23T09:01:57Z

you can change the makefile include and library to make it work on
different BLAS.

2014/1/23 Tobias Domhan [email protected]

Is the plan to completely get rid of MKL?
Just as a suggestion: it would be nice to be able to switch between
different BLAS libraries, e.g. having a BLASFactory that spits out whatever
BLAS library that is available on the system.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-33040877
.

rodrigob · 2014-01-23T15:34:07Z

Please note that on debian systems selecting the blas implementation is done via
sudo update-alternatives --config libblas.so

Such decision is certainly not meant to be done during the runtime of an application.

http://www.stat.cmu.edu/~nmv/2013/07/09/for-faster-r-use-openblas-instead-better-than-atlas-trivial-to-switch-to-on-ubuntu/

shelhamer · 2014-01-25T21:24:13Z

The ideal case for integration is that performance of the MKL and boost-eigen implementations are comparable and boost-eigen is made the default. If the MKL vs. boost/eigen differences can be insulated cleanly enough it would be nice to offer both by a build switch.

We need benchmarking to move forward and comparisons by anyone with both MKL and boost/eigen would be welcome. @Yangqing @jeffdonahue should comparing train/test of the imagenet model do it, or is are there more comparisons to be done?

kloudkl · 2014-01-26T08:45:34Z

CPU is too slow to train such a large dataset as ImageNet. Most possible use case is to first train on GPU and deploy the model on devices without GPU. Beside benchmarking the runtime of a complete pipeline, microbenchmarking of math methods/functions and profiling to find out the hotspot codes are also helpful.

shelhamer · 2014-01-26T09:20:10Z

Agreed, real training of ImageNet / any contemporary architecture and data set is infeasible on CPU. Sorry my suggestion was not more precise. I think benchmarking training minibatches or epochs is still indicative of performance. I second microbenchmarking too, as a further detail. If the speed of the full pipeline is close enough that suffices.

kloudkl · 2014-02-07T06:22:26Z

I have just benchmarked on the MNIST dataset using both the heads of the boost-eigen branch and the master. The three experiments used CPU mode with boost-eigen, CPU mode with MKL and GPU mode respectively. The CPU is Intel® Core™ i7-3770 CPU @ 3.40GHz × 8 and the GPU is NVIDIA GTX 560 Ti. But the CPU code under-utilized the available cores using only a single thread.

After training 10000 iterations, the final learning rate, training loss, testing accuracy (Test score 0) and testing loss (Test score 1) of boost-eigen and MKL were all exactly the same. The training time of boost-eigen was 26m25.259s and that of MKL was 26m43.919s. Considering the fluctuations of data IO costs, there was actually no significant performance difference. The results were a little surprising. So you may want to double check it on your own machine.

On GTX 560 Ti, it took 85.5% less time than the faster CPU mode with boost-eigen to train a slightly better model in terms of training loss, testing accuracy and testing loss.

Because the training processes also included testing iterations, this benchmark demonstrate that there is no need to further depend on a proprietary library which brings no benefit but excess codes and redundant maintenance burdens. It is time to merge this branch directly into the master.

cd data
time ./train_mnist.sh

CPU boost-eigen

I0207 12:54:18.161139 14107 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 12:54:18.163564 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:18.166762 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:18.169086 14107 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 12:54:18.169108 14107 solver.cpp:84] Testing net
I0207 12:54:25.810292 14107 solver.cpp:111] Test score #0: 0.9909
I0207 12:54:25.810333 14107 solver.cpp:111] Test score #1: 0.0285976
I0207 12:54:25.811945 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:25.815465 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:25.818124 14107 solver.cpp:78] Optimization Done.
I0207 12:54:25.818137 14107 train_net.cpp:34] Optimization Done.

real    26m25.259s
user    26m26.499s
sys 0m0.724s

CPU MKL

I0207 13:34:29.381631  4691 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:34:29.384047  4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:29.387784  4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:29.390490  4691 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 13:34:29.390512  4691 solver.cpp:84] Testing net
I0207 13:34:37.038708  4691 solver.cpp:111] Test score #0: 0.9909
I0207 13:34:37.038748  4691 solver.cpp:111] Test score #1: 0.0285976
I0207 13:34:37.040276  4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:37.043890  4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:37.046598  4691 solver.cpp:78] Optimization Done.
I0207 13:34:37.046612  4691 train_net.cpp:34] Optimization Done.

real    26m43.919s
user    26m45.056s
sys 0m0.768s

GPU

I0207 13:40:54.950667 24846 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:40:54.962781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:54.967131 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:54.970029 24846 solver.cpp:66] Iteration 10000, loss = 0.00247615
I0207 13:40:54.970067 24846 solver.cpp:84] Testing net
I0207 13:40:56.242010 24846 solver.cpp:111] Test score #0: 0.991
I0207 13:40:56.242048 24846 solver.cpp:111] Test score #1: 0.0284187
I0207 13:40:56.242781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:56.246444 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:56.249151 24846 solver.cpp:78] Optimization Done.
I0207 13:40:56.249166 24846 train_net.cpp:34] Optimization Done.

real    3m50.187s
user    3m3.219s
sys 0m50.039s

Yangqing · 2014-02-07T06:26:04Z

It would be good to have a benchmark with larger networks such as imagenet,
as MNIST might be too small to make a significant difference for any
platform. This being said I believe boost-eigen to give comparable
performances to MKL, and we should in general move to open-source libraries
in the long run.

Yangqing

On Thu, Feb 6, 2014 at 10:22 PM, kloudkl [email protected] wrote:

I have just benchmarked on the MNIST dataset using both the heads of the
boost-eigen branch and the master. The three experiments used CPU mode with
boost-eigen, CPU mode with MKL and GPU mode respectively. The CPU is Intel(R)
Core(tm) i7-3770 CPU @ 3.40GHz × 8 and the GPU is NVIDIA GTX 560 Ti. But the
CPU code under-utilized the available cores using only a single thread.

After training 10000 iterations, the final learning rate, training loss,
testing accuracy (Test score 0) and testing loss (Test score 1) of
boost-eigen and MKL were all exactly the same. The training time of
boost-eigen was 26m25.259s and that of MKL was 26m43.919s. Considering the
fluctuations of data IO costs, there was actually no significant
performance difference. The results were a little surprising. So you may
want to double check it on your own machine.

On GTX 560 Ti, it took 85.5% less time than the faster CPU mode with
boost-eigen to train a slightly better model in terms of training loss,
testing accuracy and testing loss.

Because the training processes also included testing iterations, this
benchmark demonstrate that there is no need to further depend on a
proprietary library which brings no benefit but excess codes and redundant
maintenance burdens.

cd data
time ./train_mnist.sh

CPU boost-eigen

I0207 12:54:18.161139 14107 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 12:54:18.163564 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:18.166762 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:18.169086 14107 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 12:54:18.169108 14107 solver.cpp:84] Testing net
I0207 12:54:25.810292 14107 solver.cpp:111] Test score #0: 0.9909
I0207 12:54:25.810333 14107 solver.cpp:111] Test score #1: 0.0285976
I0207 12:54:25.811945 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:25.815465 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:25.818124 14107 solver.cpp:78] Optimization Done.
I0207 12:54:25.818137 14107 train_net.cpp:34] Optimization Done.

real 26m25.259s
user 26m26.499s
sys 0m0.724s

CPU MKL

I0207 13:34:29.381631 4691 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:34:29.384047 4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:29.387784 4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:29.390490 4691 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 13:34:29.390512 4691 solver.cpp:84] Testing net
I0207 13:34:37.038708 4691 solver.cpp:111] Test score #0: 0.9909
I0207 13:34:37.038748 4691 solver.cpp:111] Test score #1: 0.0285976
I0207 13:34:37.040276 4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:37.043890 4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:37.046598 4691 solver.cpp:78] Optimization Done.
I0207 13:34:37.046612 4691 train_net.cpp:34] Optimization Done.

real 26m43.919s
user 26m45.056s
sys 0m0.768s

GPU

I0207 13:40:54.950667 24846 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:40:54.962781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:54.967131 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:54.970029 24846 solver.cpp:66] Iteration 10000, loss = 0.00247615
I0207 13:40:54.970067 24846 solver.cpp:84] Testing net
I0207 13:40:56.242010 24846 solver.cpp:111] Test score #0: 0.991
I0207 13:40:56.242048 24846 solver.cpp:111] Test score #1: 0.0284187
I0207 13:40:56.242781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:56.246444 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:56.249151 24846 solver.cpp:78] Optimization Done.
I0207 13:40:56.249166 24846 train_net.cpp:34] Optimization Done.

real 3m50.187s
user 3m3.219s
sys 0m50.039s

Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-34407255
.

aravindhm · 2014-02-07T06:46:12Z

Would it help to replace some of the code with parallel for loops. Eigen does not exploit the several cores present in most workstations except for matrix matrix multiplication. For example, the relu layer (or any simple activation function) does an independent operation for every neuron. It can be made fast using #pragma omp parallel for.

kloudkl · 2014-02-07T08:48:34Z

@aravindhm, I had the same idea as you just after observing that the training on CPU is single-threaded and experimented parallelizing with OpenMP. But the test accuracy turned out to be staying at the random guess level. Then I realized that there was conflict between OpenMP and BLAS and the correct solution is to take advantage of a multi-threaded BLAS library such as OpenBLAS. See my reference from #79 above.

kloudkl · 2014-02-07T08:56:44Z

The updated benchmark exploiting multi-threaded OpenBLAS showed great speed-up that training on a multi-core CPU can be as fast as or even faster than that on a GPU. Now, it becomes more realistic to benchmark with a larger scale dataset.

cd data
OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8 time ./train_mnist.sh

CPU boost-eigen

I0207 18:41:32.068876  8664 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 18:41:32.071004  8664 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 18:41:32.074946  8664 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 18:41:32.078304  8664 solver.cpp:66] Iteration 10000, loss = 0.00375376
I0207 18:41:32.078330  8664 solver.cpp:84] Testing net
I0207 18:41:33.663113  8664 solver.cpp:111] Test score #0: 0.9911
I0207 18:41:33.663157  8664 solver.cpp:111] Test score #1: 0.0282938
I0207 18:41:33.664984  8664 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 18:41:33.668848  8664 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 18:41:33.671816  8664 solver.cpp:78] Optimization Done.
I0207 18:41:33.671834  8664 train_net.cpp:34] Optimization Done.
768.49user 538.33system 5:27.77elapsed 398%CPU

CPU MKL

I0207 19:00:01.696180 27157 solver.cpp:207] Iteration 10000, lr = 0.00594604
I0207 19:00:01.696760 27157 solver.cpp:65] Iteration 10000, loss = 0.00308708
I0207 19:00:01.696787 27157 solver.cpp:87] Testing net
I0207 19:00:02.968822 27157 solver.cpp:114] Test score #0: 0.9905
I0207 19:00:02.968865 27157 solver.cpp:114] Test score #1: 0.0284175
I0207 19:00:02.970607 27157 solver.cpp:129] Snapshotting to lenet_iter_10000
I0207 19:00:02.974674 27157 solver.cpp:136] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 19:00:02.979106 27157 solver.cpp:129] Snapshotting to lenet_iter_10000
I0207 19:00:02.984369 27157 solver.cpp:136] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 19:00:02.990788 27157 solver.cpp:81] Optimization Done.
I0207 19:00:02.990809 27157 train_net.cpp:34] Optimization Done.
1121.49user 18.71system 4:45.62elapsed 399%CPU

Yangqing · 2014-02-13T06:08:33Z

A proposal has been made at #97 - please kindly discuss there. Closing this to reduce duplicates.

update from updstream

kloudkl mentioned this issue Jan 12, 2014

Replace MKL with Boost+Eigen3 #28

Merged

shelhamer mentioned this issue Jan 22, 2014

Why not using GSL to replace MKL? #48

Closed

kloudkl mentioned this issue Jan 23, 2014

Boosteigencompilewithboost146 #49

Merged

shelhamer added the hardware/compatibility label Feb 5, 2014

kloudkl mentioned this issue Feb 7, 2014

Support multithreading in the CPU mode of Solver::Solve #79

Closed

This was referenced Feb 7, 2014

Add steps to install multi-threaded OpenBLAS on Ubuntu #80

Closed

Replace atlas with multithreaded OpenBLAS to speed-up on multi-core CPU for the boost-eigen branch #82

Merged

Yangqing closed this as completed Feb 13, 2014

PiranjaF mentioned this issue Apr 21, 2015

Strange drops in loss/accuracy during training #2343

Closed

aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Aug 19, 2015

Add libsndfile to project (fix BVLC#16)

95203b5

aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Aug 19, 2015

Add libsndfile to project (fix BVLC#16)

6818a80

aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Aug 19, 2015

Add libsndfile to project (fix BVLC#16)

02cfe2a

aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Sep 1, 2015

Add libsndfile to project (fix BVLC#16)

f4e90c0

aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Sep 2, 2015

Add libsndfile to project (fix BVLC#16)

c344af4

chensiqin mentioned this issue Nov 28, 2015

Output accuracies per class. #2935

Merged

xnming mentioned this issue Jan 8, 2016

Segmentation fault when run test #3531

Open

anuphalarnkar mentioned this issue Feb 2, 2016

Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586

Closed

andpol5 pushed a commit to andpol5/caffe that referenced this issue Aug 24, 2016

Merge pull request BVLC#16 from BVLC/master

b2c12ae

update from updstream

JonBoyleCoding mentioned this issue Oct 26, 2016

Caffe stuck waiting on multiple boost::condition_variable in all threads in caffe::BlockingQueue #4904

Closed

duygusar mentioned this issue Apr 1, 2017

Exception: x is not one of the net inputs: {'data': (80, 3, 227, 227)} / caffe deploy.prototxt input #5479

Closed

shuguang101 mentioned this issue Jan 20, 2018

Segmentation Fault: 11 - OSX high sierra - please Help #6019

Open

lc4321 mentioned this issue Apr 10, 2018

A question about the Code of demo.py#L112 Fang-Haoshu/RMPE#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Intel MKL dependency #16

Remove Intel MKL dependency #16

rodrigob commented Dec 20, 2013

shelhamer commented Dec 21, 2013

lifeiteng commented Jan 4, 2014

kloudkl commented Jan 8, 2014

rodrigob commented Jan 8, 2014

shelhamer commented Jan 8, 2014

tdomhan commented Jan 22, 2014

lifeiteng commented Jan 23, 2014

rodrigob commented Jan 23, 2014

shelhamer commented Jan 25, 2014

kloudkl commented Jan 26, 2014

shelhamer commented Jan 26, 2014

kloudkl commented Feb 7, 2014

Yangqing commented Feb 7, 2014

aravindhm commented Feb 7, 2014

kloudkl commented Feb 7, 2014

kloudkl commented Feb 7, 2014

Yangqing commented Feb 13, 2014

Remove Intel MKL dependency #16

Remove Intel MKL dependency #16

Comments

rodrigob commented Dec 20, 2013

shelhamer commented Dec 21, 2013

lifeiteng commented Jan 4, 2014

kloudkl commented Jan 8, 2014

rodrigob commented Jan 8, 2014

shelhamer commented Jan 8, 2014

tdomhan commented Jan 22, 2014

lifeiteng commented Jan 23, 2014

rodrigob commented Jan 23, 2014

shelhamer commented Jan 25, 2014

kloudkl commented Jan 26, 2014

shelhamer commented Jan 26, 2014

kloudkl commented Feb 7, 2014

Yangqing commented Feb 7, 2014

aravindhm commented Feb 7, 2014

kloudkl commented Feb 7, 2014

kloudkl commented Feb 7, 2014

Yangqing commented Feb 13, 2014