Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Intel MKL dependency #16

Closed
rodrigob opened this issue Dec 20, 2013 · 17 comments
Closed

Remove Intel MKL dependency #16

rodrigob opened this issue Dec 20, 2013 · 17 comments

Comments

@rodrigob
Copy link
Contributor

It is mentioned in the install instruction that this is work in progress.
While at ICCV I quickly implemented a branch where I remplace matrix operations with Eigen3 calls, and random generators by Boost::random generators.
I hope this is not redundant with ongoing work on private branches.

The branch can be found at
https://github.com/rodrigob/caffe

I got things to compile, however I noticed that some tests fails (thanks for creating a non-trivial set of unit tests !).
I have not been able to compile a version with MKL to compare, but I can only assume that tests should not fail.

Current fails are

[ FAILED ] FlattenLayerTest/1.TestCPUGradient, where TypeParam = double
[ FAILED ] StochasticPoolingLayerTest/0.TestGradientGPU, where TypeParam = float
[ FAILED ] StochasticPoolingLayerTest/1.TestGradientGPU, where TypeParam = double
[ FAILED ] MultinomialLogisticLossLayerTest/1.TestGradientCPU, where TypeParam = double

which all sounds nasty (gradient computation errors in neural networks, big no no).

I will spend some time inspecting to see what goes wrong there, but any suggestion/comment/idea is welcome.

@shelhamer
Copy link
Member

Thanks for your work on this! I have not had a chance to look at this in detail yet, but I can say this is not redundant with current efforts. I'll check back when I've had a closer look, but I look forward to seeing this as a pull request once it's polished.

@lifeiteng
Copy link

Intel MKL cannot be used on some kind liunx.
Looking for more work on this.

@kloudkl
Copy link
Contributor

kloudkl commented Jan 8, 2014

In src/caffe/util/math_functions.cpp line 289

  • // FIXME check if boundaries are handled in the same way ?
  • boost::uniform_real random_distribution(a, b);

No, boost:: and std::uniform_real interval is [a, b), while Intel MKL is [a, b]. Besides, boost::uniform_real is deprecated by uniform_real_distribution. How about this work around:

using boost::variate_generator;
using boost::mt19937;
using boost::random::uniform_real_distribution;
Caffe::random_generator_t &generator = Caffe::vsl_stream();
Dtype epsilon = 1e-5; // or 1e-4, 1e-6, different values may cause some tests to fail or pass
variate_generator<mt19937, uniform_real_distribution<Dtype> > rng(generator, uniform_real_distribution<Dtype>(a, b + epsilon));
do {
  r[i] = rng();
} while (r[i] > b);

@rodrigob
Copy link
Contributor Author

rodrigob commented Jan 8, 2014

Great to see this moving, and glad that you found/understood the source of
the problem.
Stackoverflow indicates that a less hacky way of fixing this is using
std::nextafter, or for better compatibility, using boost::math::nextafter

http://www.boost.org/doc/libs/1_53_0/libs/math/doc/sf_and_dist/html/math_toolkit/utils/next_float/nextafter.html

http://stackoverflow.com/questions/16224446/stduniform-real-distribution-inclusive-range

I am not a git guru (I am more of a hg guy), in which branch is
+openwzdhhttps://github.com/openwzdh working
on ?
Should I switch to that branch to try helping out, or import into my own ?

On Wed, Jan 8, 2014 at 11:58 AM, kloudkl [email protected] wrote:

In src/caffe/util/math_functions.cpp line 289

  • // FIXME check if boundaries are handled in the same way ?
  • boost::uniform_real random_distribution(a, b);

No, boost:: and std::uniform_real interval is [a, b), while Intel MKL is
[a, b]. Besides, boost::uniform_real is deprecated by
uniform_real_distribution. How about this work around:
using boost::variate_generator;
using boost::mt19937;
using boost::uniform_real_distribution;
Caffe::random_generator_t &generator = Caffe::vsl_stream();
Dtype epsilon = 1e-5; // or 1e-4, 1e-6, different values may cause some
tests to fail or pass
variate_generator > rng(generator, uniform_real_distribution(a, b +
epsilon));
do {
r[i] = rng();
} while (r[i] > b);


Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-31822234
.

@shelhamer
Copy link
Member

This is good progress. Thanks for the commit @rodrigob and debugging @kloudkl!

Let's develop this port in the boost-eigen branch I have just pushed. I have included the initial commit by @rodrigod.

To continue development, please make commits in your fork then pull request to this branch. I will review and merge the requests.

Please rebase any work on the latest bvlc/caffe boost-eigen before requesting a pull–I'd rather keep the history clean from merge noise.

@tdomhan
Copy link
Contributor

tdomhan commented Jan 22, 2014

Is the plan to completely get rid of MKL?
Just as a suggestion: it would be nice to be able to switch between different BLAS libraries, e.g. having a BLASFactory that spits out whatever BLAS library that is available on the system.

@lifeiteng
Copy link

you can change the makefile include and library to make it work on
different BLAS.

2014/1/23 Tobias Domhan [email protected]

Is the plan to completely get rid of MKL?
Just as a suggestion: it would be nice to be able to switch between
different BLAS libraries, e.g. having a BLASFactory that spits out whatever
BLAS library that is available on the system.


Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-33040877
.

@rodrigob
Copy link
Contributor Author

Please note that on debian systems selecting the blas implementation is done via
sudo update-alternatives --config libblas.so

Such decision is certainly not meant to be done during the runtime of an application.

http://www.stat.cmu.edu/~nmv/2013/07/09/for-faster-r-use-openblas-instead-better-than-atlas-trivial-to-switch-to-on-ubuntu/

@shelhamer
Copy link
Member

The ideal case for integration is that performance of the MKL and boost-eigen implementations are comparable and boost-eigen is made the default. If the MKL vs. boost/eigen differences can be insulated cleanly enough it would be nice to offer both by a build switch.

We need benchmarking to move forward and comparisons by anyone with both MKL and boost/eigen would be welcome. @Yangqing @jeffdonahue should comparing train/test of the imagenet model do it, or is are there more comparisons to be done?

@kloudkl
Copy link
Contributor

kloudkl commented Jan 26, 2014

CPU is too slow to train such a large dataset as ImageNet. Most possible use case is to first train on GPU and deploy the model on devices without GPU. Beside benchmarking the runtime of a complete pipeline, microbenchmarking of math methods/functions and profiling to find out the hotspot codes are also helpful.

@shelhamer
Copy link
Member

Agreed, real training of ImageNet / any contemporary architecture and data set is infeasible on CPU. Sorry my suggestion was not more precise. I think benchmarking training minibatches or epochs is still indicative of performance. I second microbenchmarking too, as a further detail. If the speed of the full pipeline is close enough that suffices.

@kloudkl
Copy link
Contributor

kloudkl commented Feb 7, 2014

I have just benchmarked on the MNIST dataset using both the heads of the boost-eigen branch and the master. The three experiments used CPU mode with boost-eigen, CPU mode with MKL and GPU mode respectively. The CPU is Intel® Core™ i7-3770 CPU @ 3.40GHz × 8 and the GPU is NVIDIA GTX 560 Ti. But the CPU code under-utilized the available cores using only a single thread.

After training 10000 iterations, the final learning rate, training loss, testing accuracy (Test score 0) and testing loss (Test score 1) of boost-eigen and MKL were all exactly the same. The training time of boost-eigen was 26m25.259s and that of MKL was 26m43.919s. Considering the fluctuations of data IO costs, there was actually no significant performance difference. The results were a little surprising. So you may want to double check it on your own machine.

On GTX 560 Ti, it took 85.5% less time than the faster CPU mode with boost-eigen to train a slightly better model in terms of training loss, testing accuracy and testing loss.

Because the training processes also included testing iterations, this benchmark demonstrate that there is no need to further depend on a proprietary library which brings no benefit but excess codes and redundant maintenance burdens. It is time to merge this branch directly into the master.

cd data
time ./train_mnist.sh

CPU boost-eigen

I0207 12:54:18.161139 14107 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 12:54:18.163564 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:18.166762 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:18.169086 14107 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 12:54:18.169108 14107 solver.cpp:84] Testing net
I0207 12:54:25.810292 14107 solver.cpp:111] Test score #0: 0.9909
I0207 12:54:25.810333 14107 solver.cpp:111] Test score #1: 0.0285976
I0207 12:54:25.811945 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:25.815465 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:25.818124 14107 solver.cpp:78] Optimization Done.
I0207 12:54:25.818137 14107 train_net.cpp:34] Optimization Done.

real    26m25.259s
user    26m26.499s
sys 0m0.724s

CPU MKL

I0207 13:34:29.381631  4691 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:34:29.384047  4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:29.387784  4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:29.390490  4691 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 13:34:29.390512  4691 solver.cpp:84] Testing net
I0207 13:34:37.038708  4691 solver.cpp:111] Test score #0: 0.9909
I0207 13:34:37.038748  4691 solver.cpp:111] Test score #1: 0.0285976
I0207 13:34:37.040276  4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:37.043890  4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:37.046598  4691 solver.cpp:78] Optimization Done.
I0207 13:34:37.046612  4691 train_net.cpp:34] Optimization Done.

real    26m43.919s
user    26m45.056s
sys 0m0.768s

GPU

I0207 13:40:54.950667 24846 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:40:54.962781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:54.967131 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:54.970029 24846 solver.cpp:66] Iteration 10000, loss = 0.00247615
I0207 13:40:54.970067 24846 solver.cpp:84] Testing net
I0207 13:40:56.242010 24846 solver.cpp:111] Test score #0: 0.991
I0207 13:40:56.242048 24846 solver.cpp:111] Test score #1: 0.0284187
I0207 13:40:56.242781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:56.246444 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:56.249151 24846 solver.cpp:78] Optimization Done.
I0207 13:40:56.249166 24846 train_net.cpp:34] Optimization Done.

real    3m50.187s
user    3m3.219s
sys 0m50.039s

@Yangqing
Copy link
Member

Yangqing commented Feb 7, 2014

It would be good to have a benchmark with larger networks such as imagenet,
as MNIST might be too small to make a significant difference for any
platform. This being said I believe boost-eigen to give comparable
performances to MKL, and we should in general move to open-source libraries
in the long run.

Yangqing

On Thu, Feb 6, 2014 at 10:22 PM, kloudkl [email protected] wrote:

I have just benchmarked on the MNIST dataset using both the heads of the
boost-eigen branch and the master. The three experiments used CPU mode with
boost-eigen, CPU mode with MKL and GPU mode respectively. The CPU is Intel(R)
Core(tm) i7-3770 CPU @ 3.40GHz × 8 and the GPU is NVIDIA GTX 560 Ti. But the
CPU code under-utilized the available cores using only a single thread.

After training 10000 iterations, the final learning rate, training loss,
testing accuracy (Test score 0) and testing loss (Test score 1) of
boost-eigen and MKL were all exactly the same. The training time of
boost-eigen was 26m25.259s and that of MKL was 26m43.919s. Considering the
fluctuations of data IO costs, there was actually no significant
performance difference. The results were a little surprising. So you may
want to double check it on your own machine.

On GTX 560 Ti, it took 85.5% less time than the faster CPU mode with
boost-eigen to train a slightly better model in terms of training loss,
testing accuracy and testing loss.

Because the training processes also included testing iterations, this
benchmark demonstrate that there is no need to further depend on a
proprietary library which brings no benefit but excess codes and redundant
maintenance burdens.

cd data
time ./train_mnist.sh

CPU boost-eigen

I0207 12:54:18.161139 14107 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 12:54:18.163564 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:18.166762 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:18.169086 14107 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 12:54:18.169108 14107 solver.cpp:84] Testing net
I0207 12:54:25.810292 14107 solver.cpp:111] Test score #0: 0.9909
I0207 12:54:25.810333 14107 solver.cpp:111] Test score #1: 0.0285976
I0207 12:54:25.811945 14107 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 12:54:25.815465 14107 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 12:54:25.818124 14107 solver.cpp:78] Optimization Done.
I0207 12:54:25.818137 14107 train_net.cpp:34] Optimization Done.

real 26m25.259s
user 26m26.499s
sys 0m0.724s

CPU MKL

I0207 13:34:29.381631 4691 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:34:29.384047 4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:29.387784 4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:29.390490 4691 solver.cpp:66] Iteration 10000, loss = 0.0033857
I0207 13:34:29.390512 4691 solver.cpp:84] Testing net
I0207 13:34:37.038708 4691 solver.cpp:111] Test score #0: 0.9909
I0207 13:34:37.038748 4691 solver.cpp:111] Test score #1: 0.0285976
I0207 13:34:37.040276 4691 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:34:37.043890 4691 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:34:37.046598 4691 solver.cpp:78] Optimization Done.
I0207 13:34:37.046612 4691 train_net.cpp:34] Optimization Done.

real 26m43.919s
user 26m45.056s
sys 0m0.768s

GPU

I0207 13:40:54.950667 24846 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 13:40:54.962781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:54.967131 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:54.970029 24846 solver.cpp:66] Iteration 10000, loss = 0.00247615
I0207 13:40:54.970067 24846 solver.cpp:84] Testing net
I0207 13:40:56.242010 24846 solver.cpp:111] Test score #0: 0.991
I0207 13:40:56.242048 24846 solver.cpp:111] Test score #1: 0.0284187
I0207 13:40:56.242781 24846 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 13:40:56.246444 24846 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 13:40:56.249151 24846 solver.cpp:78] Optimization Done.
I0207 13:40:56.249166 24846 train_net.cpp:34] Optimization Done.

real 3m50.187s
user 3m3.219s
sys 0m50.039s

Reply to this email directly or view it on GitHubhttps://github.com//issues/16#issuecomment-34407255
.

@aravindhm
Copy link

Would it help to replace some of the code with parallel for loops. Eigen does not exploit the several cores present in most workstations except for matrix matrix multiplication. For example, the relu layer (or any simple activation function) does an independent operation for every neuron. It can be made fast using #pragma omp parallel for.

@kloudkl
Copy link
Contributor

kloudkl commented Feb 7, 2014

@aravindhm, I had the same idea as you just after observing that the training on CPU is single-threaded and experimented parallelizing with OpenMP. But the test accuracy turned out to be staying at the random guess level. Then I realized that there was conflict between OpenMP and BLAS and the correct solution is to take advantage of a multi-threaded BLAS library such as OpenBLAS. See my reference from #79 above.

@kloudkl
Copy link
Contributor

kloudkl commented Feb 7, 2014

The updated benchmark exploiting multi-threaded OpenBLAS showed great speed-up that training on a multi-core CPU can be as fast as or even faster than that on a GPU. Now, it becomes more realistic to benchmark with a larger scale dataset.

cd data
OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8 time ./train_mnist.sh

CPU boost-eigen

I0207 18:41:32.068876  8664 solver.cpp:204] Iteration 10000, lr = 0.00594604
I0207 18:41:32.071004  8664 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 18:41:32.074946  8664 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 18:41:32.078304  8664 solver.cpp:66] Iteration 10000, loss = 0.00375376
I0207 18:41:32.078330  8664 solver.cpp:84] Testing net
I0207 18:41:33.663113  8664 solver.cpp:111] Test score #0: 0.9911
I0207 18:41:33.663157  8664 solver.cpp:111] Test score #1: 0.0282938
I0207 18:41:33.664984  8664 solver.cpp:126] Snapshotting to lenet_iter_10000
I0207 18:41:33.668848  8664 solver.cpp:133] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 18:41:33.671816  8664 solver.cpp:78] Optimization Done.
I0207 18:41:33.671834  8664 train_net.cpp:34] Optimization Done.
768.49user 538.33system 5:27.77elapsed 398%CPU

CPU MKL

I0207 19:00:01.696180 27157 solver.cpp:207] Iteration 10000, lr = 0.00594604
I0207 19:00:01.696760 27157 solver.cpp:65] Iteration 10000, loss = 0.00308708
I0207 19:00:01.696787 27157 solver.cpp:87] Testing net
I0207 19:00:02.968822 27157 solver.cpp:114] Test score #0: 0.9905
I0207 19:00:02.968865 27157 solver.cpp:114] Test score #1: 0.0284175
I0207 19:00:02.970607 27157 solver.cpp:129] Snapshotting to lenet_iter_10000
I0207 19:00:02.974674 27157 solver.cpp:136] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 19:00:02.979106 27157 solver.cpp:129] Snapshotting to lenet_iter_10000
I0207 19:00:02.984369 27157 solver.cpp:136] Snapshotting solver state to lenet_iter_10000.solverstate
I0207 19:00:02.990788 27157 solver.cpp:81] Optimization Done.
I0207 19:00:02.990809 27157 train_net.cpp:34] Optimization Done.
1121.49user 18.71system 4:45.62elapsed 399%CPU

@Yangqing
Copy link
Member

A proposal has been made at #97 - please kindly discuss there. Closing this to reduce duplicates.

aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Aug 19, 2015
aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Aug 19, 2015
aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Aug 19, 2015
aidangomez added a commit to alejandro-isaza/caffe that referenced this issue Sep 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants