Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add steps to install multi-threaded OpenBLAS on Ubuntu #80

Closed
wants to merge 1 commit into from
Closed

Add steps to install multi-threaded OpenBLAS on Ubuntu #80

wants to merge 1 commit into from

Conversation

kloudkl
Copy link
Contributor

@kloudkl kloudkl commented Feb 7, 2014

Multi-threaded OpenBLAS makes a huge performance difference. The benchmarks with and without it in comments to #16 demonstrated more than 5 times speed-up for boost-eigen and MKL on a machine with 4 Hyper-Threading CPU cores (supporting 8 threads).

This fixes #79.

@Yangqing
Copy link
Member

Yangqing commented Feb 7, 2014

Are you sure when using boost-eigen, you are compiling with multi-thread enabled? boost-eigen naturally comes with multithreaded gemm, which would probably account for most of the gain you are observing.

@Yangqing
Copy link
Member

Yangqing commented Feb 7, 2014

@kloudkl
Copy link
Contributor Author

kloudkl commented Feb 7, 2014

To make it clear whether OpenBLAS or Eigen contributed to the performance improvements in the boost-eigen branch, three groups of benchmark experiments with different compilation flags are conducted using the lenet*.prototxt. For all the experiments, max iter is set to 3,000 and solver_mode is 0.

cf_id compilation flags
1 -latlas -lcblas -fopenmp
2 -lopenblas
3 -lopenblas -fopenmp

To check the effects of threads number, three runtime environment variables combinations are tested.

rev_id runtime environment variables
1 ``
2 OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4
3 OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8

In all the experiments, max iter is set to 3,000 and solver_mode is set to 0 in lenet_solver.prototxt.

cf_id rev_id real time user time system time
1 1 500.638 500.559 0.328
1 2 501.15 501.37 0.26
2 1 99.787 230.694 166.238
2 2 99.42 228.74 166.25
2 3 100.56 232.78 166.66
3 1 99.915 231.802 165.206
3 2 99.34 229.79 165.15
3 3 99.73 232.86 163.89

Comparing the results of compilation flags 1 and 3, it is evident that the multi-threaded OpenBLAS runs about 5 times faster than the normal ATLAS. The similar performances of compilation flags 2 and 3 prove that enabling OpenMP for Eigen does not help at all in this setting.

@Yangqing
Copy link
Member

Yangqing commented Feb 7, 2014

I still do not think you are using the multithreaded version of eigen3.
With benchmarks as follows:

https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba

it would be extremely unlikely that eigen itself is bad in multithreading.
Could you double-check with a gemm call that your eigen version is using
multiple threads (by e.g. looking at top)?

Again, using lenet is not a good idea to benchmark things, use
net_speed_test instead, which fits real-world use cases better.

Yangqing

On Fri, Feb 7, 2014 at 10:39 AM, kloudkl [email protected] wrote:

To make it clear whether OpenBLAS or Eigen contributed to the performance
improvements in the boost-eigen branch, three groups of benchmark
experiments with different compilation flags are conducted using the
lenet*.prototxt. For all the experiments, max iter is set to 3,000 and
solver_mode is 0.
cf_id compilation flags 1 -latlas -lcblas -fopenmp 2 -lopenblas 3 -lopenblas
-fopenmp

To check the effects of threads number, three runtime environment
variables combinations are tested.
rev_id runtime environment variables 1 `` 2 OPENBLAS_NUM_THREADS=4
OMP_NUM_THREADS=4 3 OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8

In all the experiments, max iter is set to 3,000 and solver_mode is set to
0 in lenet_solver.prototxt.
cf_id rev_id real time user time system time 1 1 500.638 500.559 0.328
1 2 501.15 501.37 0.26 2 1 99.787 230.694 166.238 2 2 99.42 228.74
166.25 2 3 100.56 232.78 166.66 3 1 99.915 231.802 165.206 3 2 99.34
229.79 165.15 3 3 99.73 232.86 163.89

Comparing the results of compilation flags 1 and 3, it is evident that the
multi-threaded OpenBLAS runs about 5 times faster than the normal ATLAS.
The similar performances of compilation flags 2 and 3 prove that enabling
OpenMP for Eigen does not help at all in this setting.

Reply to this email directly or view it on GitHubhttps://github.com//pull/80#issuecomment-34486255
.

@Yangqing
Copy link
Member

Yangqing commented Feb 7, 2014

I'd like to make my arguments clear:

(1) I am not comparing ATLAS with OpenBLAS - it is known that ATLAS is
inherently single-threaded. Eigen compiled with Atlas backend is not what I
mean here - I mean native implementations of blas inside eigen.

(2) small datasets like MNIST does not reflect actual use cases such as
imagenet. In imagenet experiments, more than 80% computation time is spent
on gemm, so it really boils down to the point whether gemm is parallelized
or not - and I believe eigen does have that parallelized. Please provide a
more detailed analysis where the speedup comes from and why, rather than an
end-to-end run (honestly, that may not reveal too much information).

Yangqing

On Fri, Feb 7, 2014 at 10:44 AM, Yangqing Jia [email protected] wrote:

I still do not think you are using the multithreaded version of eigen3.
With benchmarks as follows:

https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba

it would be extremely unlikely that eigen itself is bad in multithreading.
Could you double-check with a gemm call that your eigen version is using
multiple threads (by e.g. looking at top)?

Again, using lenet is not a good idea to benchmark things, use
net_speed_test instead, which fits real-world use cases better.

Yangqing

On Fri, Feb 7, 2014 at 10:39 AM, kloudkl [email protected] wrote:

To make it clear whether OpenBLAS or Eigen contributed to the performance
improvements in the boost-eigen branch, three groups of benchmark
experiments with different compilation flags are conducted using the
lenet*.prototxt. For all the experiments, max iter is set to 3,000 and
solver_mode is 0.
cf_id compilation flags 1 -latlas -lcblas -fopenmp 2 -lopenblas 3 -lopenblas
-fopenmp

To check the effects of threads number, three runtime environment
variables combinations are tested.
rev_id runtime environment variables 1 `` 2 OPENBLAS_NUM_THREADS=4
OMP_NUM_THREADS=4 3 OPENBLAS_NUM_THREADS=8 OMP_NUM_THREADS=8

In all the experiments, max iter is set to 3,000 and solver_mode is set
to 0 in lenet_solver.prototxt.
cf_id rev_id real time user time system time 1 1 500.638 500.559 0.328
1 2 501.15 501.37 0.26 2 1 99.787 230.694 166.238 2 2 99.42 228.74
166.25 2 3 100.56 232.78 166.66 3 1 99.915 231.802 165.206 3 2 99.34
229.79 165.15 3 3 99.73 232.86 163.89

Comparing the results of compilation flags 1 and 3, it is evident that
the multi-threaded OpenBLAS runs about 5 times faster than the normal
ATLAS. The similar performances of compilation flags 2 and 3 prove that
enabling OpenMP for Eigen does not help at all in this setting.

Reply to this email directly or view it on GitHubhttps://github.com//pull/80#issuecomment-34486255
.

@Yangqing
Copy link
Member

Yangqing commented Feb 7, 2014

I looked at the code more closely and now I have a little clearer picture on what caused this. in caffe/util/math_functions.cpp the gemm calls are still made using cblas_gemm instead of the eigen function, making the framework effectively still using atlas rather than eigen to carry out gemm. I will close this issue and open a separate issue indicating this necessary change for boost-eigen. If you would like to do a more detailed comparison please feel free to. Thanks for finding this bug!

@shelhamer
Copy link
Member

Thank you for all this benchmarking work!

@shelhamer
Copy link
Member

INSTALL.md has been replaced with a pointer to the online installation documentation to avoid the overhead of duplication, so refer to #81.

@kloudkl kloudkl deleted the multi_threaded_blas branch February 11, 2014 05:47
@jeffhammond
Copy link

This statement is categorically false: "it is known that ATLAS is inherently single-threaded." ATLAS has been threaded for 5+ years

http://math-atlas.sourceforge.net/faq.html#tnum
http://math-atlas.sourceforge.net/timing/newThr395/index.html

thatguymike added a commit to thatguymike/caffe that referenced this pull request Dec 2, 2015
Add cudnn v4 batch normalization integration
myfavouritekk pushed a commit to myfavouritekk/caffe that referenced this pull request Aug 11, 2016
* Fix boost shared_ptr issue in python interface

* Default output model name for bn convert style script

* Fix bugs in generation bn inference model

* Script to convert inner product to convolution

* Script to do polyak averaging
myfavouritekk added a commit to myfavouritekk/caffe that referenced this pull request Aug 11, 2016
standardize memory optimization configurations

* yjxiong/fix/mem_config:
  take care of share data with excluded blob
  improvise memory opt configs
  fix cudnn conv legacy bug (BVLC#96)
  add TOC
  Update README.md
  Update README.md (BVLC#95)
  Update README.md
  Improve the python interface (BVLC#80)
  Update README.md
myfavouritekk added a commit to myfavouritekk/caffe that referenced this pull request Aug 15, 2016
…caffe into imagenet_vid_2016

* 'imagenet_vid_2016' of https://github.com/myfavouritekk/caffe:
  take care of share data with excluded blob
  Revert "Fix a but when setting no_mem_opt: true for layers near in-place layers."
  improvise memory opt configs
  fix cudnn conv legacy bug (BVLC#96)
  add TOC
  Update README.md
  Update README.md (BVLC#95)
  Update README.md
  Improve the python interface (BVLC#80)
  Update README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multithreading in the CPU mode of Solver::Solve
4 participants