Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.
Work in progress! I am still working through each convolution module in each library, THIS IS NOT AN EXHAUSTIVE LIST!
- After getting an initial baseline with the single module below (and getting inital benchmark scripts), I will benchmark a full AlexNet/MattNet/Overfeat
Machine: 6-core Intel i7-3930K @ 3.20GHz
+ NVIDIA Titan Black
+ Ubuntu 14.04 x86_64
###Spatial Convolution layer (3D input 3D output, densely connected)
Original Library | Class/Function Benchmarked | Total Time (ms) | Total forward time (ms) | Total backward time (ms) | Peak Memory Formula | Limitations |
---|---|---|---|---|---|---|
Theano (experimental)*** | conv2d_fft | 1178 | 304 | 874 | ||
Caffe | ConvolutionLayer | 1787 | 537 | 1250 | ||
cuda-convnet2 * | ConvLayer | 1818 | 416 | 1402 | ||
NVidia CuDNN * | cudnn.SpatialConvolution | 1861 | 513 | 1348 | ||
Torch-7 | nn.SpatialConvolutionBHWD | 1892 | 581 | 1311 | ||
Torch-7 | SpatialConvolutionMM | 1936 | 581 | 1355 | ||
Theano (experimental) | CorrMM | 2063 | 630 | 1433 | ||
cuda-convnet** | pylearn2.cuda_convnet | 3287 | 727 | 2560 | ||
ccv | ccv_convnet_layer | 809+bw | 809 | |||
Theano (legacy) | conv2d | 70774 | 3833 | 66941 | ||
cherry-picking**** | best per layer | 985 | 191 | 794 |
- * indicates that the library was tested with Torch bindings of the specific kernels.
- ** indicates that the library was tested with Pylearn2 bindings.
- *** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
- **** The last row shows results obtainable when choosing the best-performing library for each layer.
- L1 - Input:
128x128
Batch-size128
, Feature maps:3->96
, Kernel Size:11x11
, Stride:1x1
- L2 - Input:
64x64
Batch-size128
, Feature maps:64->128
, Kernel Size:9x9
, Stride:1x1
- L3 - Input:
32x32
Batch-size128
, Feature maps:128->128
, Kernel Size:9x9
, Stride:1x1
- L4 - Input:
16x16
Batch-size128
, Feature maps:128->128
, Kernel Size:7x7
, Stride:1x1
- L5 - Input:
13x13
Batch-size128
, Feature maps:384->384
, Kernel Size:3x3
, Stride:1x1
- The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)
#####Breakdown
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
Theano (experimental)*** | conv2d_fft | 138 | 73 | 30 | 9 | 39 | 304 |
Caffe | ConvolutionLayer<Dtype> | 100 | 205 | 158 | 35 | 39 | 537 |
cuda-convnet2 * | ConvLayer | 63 | 241 | 86 | 9 | 17 | 416 |
NVidia CuDNN | cudnn.SpatialConvolution | 94 | 274 | 101 | 12 | 32 | 513 |
Torch-7 | nn.SpatialConvolutionBHWD | 182 | 279 | 94 | 11 | 15 | 581 |
Torch-7 | nn.SpatialConvolutionMM | 105 | 239 | 168 | 32 | 37 | 581 |
Theano (experimental) | CorrMM | 100 | 251 | 197 | 38 | 44 | 630 |
cuda-convnet** | pylearn2.cuda_convnet | 92 | 412 | 159 | 19 | 45 | 727 |
ccv | ccv_convnet_layer | 121 | 437 | 182 | 23 | 44 | 809 |
Theano (legacy) | conv2d | 408 | 2310 | 739 | 99 | 277 | 3833 |
cherry-picking**** | best per layer | 63 | 72 | 30 | 9 | 17 | 191 |
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
Theano (experimental)*** | conv2d_fft | 449 | 218 | 89 | 28 | 90 | 874 |
Caffe | ConvolutionLayer<Dtype> | 307 | 599 | 242 | 42 | 60 | 1250 |
cuda-convnet2 * | ConvLayer | 586 | 570 | 190 | 19 | 37 | 1402 |
NVidia CuDNN | cudnn.SpatialConvolution | 226 | 736 | 297 | 32 | 57 | 1348 |
Torch-7 | nn.SpatialConvolutionBHWD | 513 | 562 | 187 | 21 | 28 | 1311 |
Torch-7 | nn.SpatialConvolutionMM | 301 | 673 | 270 | 47 | 64 | 1355 |
Theano (experimental) | CorrMM | 282 | 733 | 295 | 51 | 72 | 1433 |
cuda-convnet** | pylearn2.cuda_convnet | 618 | 1305 | 473 | 50 | 114 | 2560 |
ccv | ccv_convnet_layer | ||||||
Theano (legacy) | conv2d | 53997 | 9752 | 2202 | 299 | 691 | 66941 |
cherry-picking**** | best per layer | 285 | 337 | 118 | 17 | 37 | 794 |
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
Theano (experimental)*** | conv2d_fft | 250 | 111 | 54 | 19 | 48 | 482 |
Caffe | ConvolutionLayer<Dtype> | 86 | 271 | 120 | 20 | 26 | 523 |
cuda-convnet2 * | ConvLayer | 131 | 230 | 82 | 8 | 16 | 467 |
Theano (experimental) | CorrMM | 87 | 328 | 142 | 25 | 31 | 613 |
NVidia CuDNN | cudnn.SpatialConvolution | 111 | 421 | 180 | 17 | 21 | 750 |
Torch-7 | nn.SpatialConvolutionBHWD | 276 | 277 | 102 | 11 | 14 | 680 |
Torch-7 | nn.SpatialConvolutionMM | 91 | 302 | 129 | 23 | 27 | 572 |
cuda-convnet** | pylearn2.cuda_convnet | 155 | 647 | 230 | 23 | 47 | 1102 |
ccv | ccv_convnet_layer | ||||||
Theano (legacy) | conv2d | 53340 | 2690 | 1044 | 171 | 406 | 57651 |
cherry-picking**** | best per layer | 86 | 230 | 82 | 8 | 16 | 422 |
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
Theano (experimental)*** | conv2d_fft | 199 | 107 | 35 | 9 | 42 | 392 |
Caffe | ConvolutionLayer<Dtype> | 221 | 328 | 122 | 22 | 34 | 727 |
cuda-convnet2 * | ConvLayer | 455 | 340 | 108 | 11 | 21 | 935 |
NVidia CuDNN | cudnn.SpatialConvolution | 115 | 315 | 117 | 15 | 36 | 598 |
Torch-7 | nn.SpatialConvolutionBHWD | 237 | 285 | 85 | 10 | 14 | 631 |
Torch-7 | nn.SpatialConvolutionMM | 210 | 371 | 141 | 24 | 37 | 783 |
Theano (experimental) | CorrMM | 195 | 405 | 153 | 26 | 41 | 820 |
cuda-convnet** | pylearn2.cuda_convnet | 463 | 658 | 243 | 27 | 67 | 2069 |
ccv | ccv_convnet_layer | ||||||
Theano (legacy) | conv2d | 657 | 7062 | 1158 | 128 | 285 | 9290 |
cherry-picking**** | best per layer | 199 | 107 | 36 | 9 | 21 | 372 |