Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The speed of Deep_Residual_Learning_CIFAR-10.py #57

Open
kli-casia opened this issue Apr 17, 2016 · 16 comments
Open

The speed of Deep_Residual_Learning_CIFAR-10.py #57

kli-casia opened this issue Apr 17, 2016 · 16 comments

Comments

@kli-casia
Copy link

I am using Ubuntu 14.04 with a titan x GPU and cudnn v4.
When I run the Deep_Residual_Learning_CIFAR-10.py script
The results are as follows

Using gpu device 3: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 4007)
Loading data...
Building model and compiling functions...
number of parameters in model: 464154
Starting training...
Epoch 1 of 82 took 194.576s
  training loss:                1.885696
  validation loss:              1.185959
  validation accuracy:          58.67 %
Epoch 2 of 82 took 192.870s
  training loss:                1.241701
  validation loss:              0.860152
  validation accuracy:          70.31 %
Epoch 3 of 82 took 196.294s
  training loss:                0.969671
  validation loss:              0.767420
  validation accuracy:          75.22 %
Epoch 4 of 82 took 193.210s
  training loss:                0.831287
  validation loss:              0.929070
  validation accuracy:          72.58 %
Epoch 5 of 82 took 194.127s
  training loss:                0.757095
  validation loss:              0.628573
  validation accuracy:          79.36 %
Epoch 6 of 82 took 194.738s
  training loss:                0.710511
  validation loss:              0.578690
  validation accuracy:          80.73 %

Is this running speed normal?

@f0k
Copy link
Member

f0k commented Apr 18, 2016

Is this running speed normal?

I don't know for this particular case, but in general, if you're in doubt, you can run the script with CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=1 python the_script.py, stop it after one epoch and check if there are any CPU operations in the graph (i.e., operations that do not have Gpu in their name). If so, something is wrong with the script or your setup. Does the Lasagne MNIST example run at normal speed (on a Titan X, it should probably take under 1 sec per epoch for the CNN architecture, and under .3 sec for the MLP)?

@ebenolson
Copy link
Member

For me on Titan X with cuDNN v4, epoch time is about 126s.

@PatrickBue
Copy link

Hi,
I am also using a Titan X, cuDNN v5, unmodified code, but am getting epoch times of around 900 seconds:

Using gpu device 0: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 5005)
Epoch 1 of 82 took 862.625s
  training loss:        2.011102
  validation loss:      1.258493
  validation accuracy:      54.50 %
Epoch 2 of 82 took 914.957s
  training loss:        1.258001
  validation loss:      0.854177
  validation accuracy:      70.54 %
Epoch 3 of 82 took 887.374s
  training loss:        0.965086
  validation loss:      0.753001
  validation accuracy:      74.92 %

@f0k
Copy link
Member

f0k commented May 18, 2016

Interesting -- could you please check what I posted before (#57 (comment))?

@PatrickBue
Copy link

PatrickBue commented May 18, 2016

Hi, thanks for looking into this.

I think my system is correctly set up (python 2.7.11, 64bit, lasagne version 0.2.dev1). I verified that:

  • Theano's CNN MNIST example runs almost 2x faster (18min) on my Titan X compared to the GTX 480.
  • Lasagne's CNN MNIST example on my machine takes 9 seconds per iteration with cuDNN, and 17 without. I could not find a reference time, so can only guess that this is as expected.

I ran the profiling (1 epoch, using 1,000 instead of all 100,000 training examples), and looks like it only uses GPU operations:

Using gpu device 0: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 5005)
Function profiling
==================
  Message: Deep_Residual_Learning_CIFAR-10.py:247
  Time in 7 calls to Function.__call__: 7.454000e+00s
  Time in Function.fn.__call__: 7.454000e+00s (100.000%)
  Time in thunks: 5.195546e+00s (69.701%)
  Total compile time: 5.661100e+01s
    Number of Apply nodes: 2301
    Theano Optimizer time: 5.039100e+01s
       Theano validate time: 5.015000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 3.797000e+00s
       Import time 2.960000e-01s

Time in all call to theano.grad() 1.765000e+00s
Time since theano import 99.413s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  77.5%    77.5%       4.029s       5.54e-04s     C     7266    1038   theano.sandbox.cuda.basic_ops.GpuElemwise
   9.2%    86.8%       0.479s       1.43e-03s     C      336      48   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
   5.7%    92.5%       0.297s       1.68e-04s     C     1764     252   theano.sandbox.cuda.basic_ops.GpuCAReduce
   2.4%    94.9%       0.125s       5.96e-04s     C      210      30   theano.sandbox.cuda.dnn.GpuDnnConvGradI
   2.1%    97.0%       0.109s       5.04e-04s     C      217      31   theano.sandbox.cuda.dnn.GpuDnnConvGradW
   0.9%    97.9%       0.047s       9.57e-04s     C       49       7   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.6%    98.5%       0.031s       1.44e-04s     C      217      31   theano.sandbox.cuda.dnn.GpuDnnConv
   0.6%    99.1%       0.031s       1.49e-03s     C       21       3   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.3%    99.4%       0.016s       7.35e-06s     C     2128     304   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%    99.7%       0.016s       2.30e-05s     C      679      97   theano.tensor.opt.MakeVector
   0.3%   100.0%       0.016s       5.58e-04s     C       28       4   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.0%   100.0%       0.000s       0.00e+00s     C      959     137   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.0%   100.0%       0.000s       0.00e+00s     C      896     128   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       0.00e+00s     C      644      92   theano.sandbox.cuda.dnn.GpuDnnConvDesc
   0.0%   100.0%       0.000s       0.00e+00s     C      518      74   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.000s       0.00e+00s     C       91      13   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.000s       0.00e+00s     C       28       4   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.0%   100.0%       0.000s       0.00e+00s     C       14       2   theano.sandbox.cuda.basic_ops.GpuReshape
   0.0%   100.0%       0.000s       0.00e+00s     C       14       2   theano.sandbox.cuda.blas.GpuDot22
   0.0%   100.0%       0.000s       0.00e+00s     C        7       1   theano.sandbox.cuda.nnet.GpuCrossentropySoftmax1HotWithBiasDx
   ... (remaining 3 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  15.6%    15.6%       0.810s       8.57e-04s     C      945      135   GpuElemwise{sub,no_inplace}
   9.2%    24.8%       0.479s       1.43e-03s     C      336       48   GpuAllocEmpty
   8.8%    33.6%       0.457s       2.18e-03s     C      210       30   GpuElemwise{sgn,no_inplace}
   7.9%    41.5%       0.411s       1.95e-03s     C      210       30   GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}
   7.3%    48.8%       0.380s       1.75e-03s     C      217       31   GpuElemwise{Composite{((i0 * i1 * i2) + (i0 * i1 * i2 * i3))},no_inplace}
   6.4%    55.2%       0.333s       1.53e-03s     C      217       31   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}
   5.8%    61.0%       0.300s       1.38e-03s     C      217       31   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
   5.4%    66.4%       0.282s       1.30e-03s     C      217       31   GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}
   4.2%    70.7%       0.220s       1.01e-03s     C      217       31   GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}
   3.9%    74.6%       0.203s       1.81e-03s     C      112       16   GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}
   3.6%    78.2%       0.188s       1.73e-04s     C     1085      155   GpuCAReduce{add}{1,0,1,1}
   3.2%    81.4%       0.165s       1.57e-03s     C      105       15   GpuElemwise{Composite{(((i0 * i1) + i2) + i3)},no_inplace}
   2.4%    83.8%       0.125s       5.96e-04s     C      210       30   GpuDnnConvGradI{algo='none', inplace=True}
   2.4%    86.2%       0.125s       1.16e-04s     C     1078      154   GpuElemwise{mul,no_inplace}
   2.1%    88.3%       0.109s       5.04e-04s     C      217       31   GpuDnnConvGradW{algo='none', inplace=True}
   1.8%    90.1%       0.094s       4.32e-04s     C      217       31   GpuCAReduce{pre=sqr,red=add}{1,0,1,1}
   1.2%    91.3%       0.062s       1.10e-04s     C      567       81   GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}
   0.9%    92.2%       0.047s       3.72e-04s     C      126       18   GpuElemwise{Sub}[(0, 1)]
   0.9%    93.1%       0.047s       7.20e-05s     C      651       93   GpuElemwise{Composite{(i0 + (i1 - i2))},no_inplace}
   0.9%    94.0%       0.047s       2.31e-04s     C      203       29   GpuElemwise{Composite{(i0 * ((i1 * i2) + (i3 * i4)))},no_inplace}
   ... (remaining 63 Ops account for   6.02%(0.31s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
   1.2%     1.2%       0.064s       9.10e-03s      7   651   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.2%     2.4%       0.063s       8.93e-03s      7   2225   GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0, GpuElemwise{sgn,no_inplace}.0, GpuDimShuffle{x,0,x,x}.0)
   1.2%     3.6%       0.063s       8.93e-03s      7   956   GpuElemwise{sub,no_inplace}(GpuDnnConv{algo='small', inplace=True}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
   1.2%     4.8%       0.063s       8.93e-03s      7   885   GpuElemwise{sgn,no_inplace}(GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}.0)
   0.9%     5.8%       0.049s       6.96e-03s      7   1159   GpuElemwise{sub,no_inplace}(GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
   0.9%     6.7%       0.047s       6.73e-03s      7   2144   GpuElemwise{Composite{((i0 * i1 * i2) + (i0 * i1 * i2 * i3))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{sgn,no_inplace}.0)
   0.9%     7.6%       0.047s       6.70e-03s      7   2034   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sgn,no_inplace}.0)
   0.9%     8.5%       0.047s       6.70e-03s      7   1979   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sgn,no_inplace}.0)
   0.9%     9.4%       0.047s       6.70e-03s      7   2117   GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0, GpuElemwise{sgn,no_inplace}.0, GpuDimShuffle{x,0,x,x}.0)
   0.9%    10.3%       0.047s       6.70e-03s      7   618   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   0.9%    11.2%       0.047s       6.70e-03s      7   594   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   0.9%    12.1%       0.047s       6.70e-03s      7   2244   GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}(GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}.0, GpuElemwise{Composite{((((-i0) / i1) / i2) / i3)},no_inplace}.0, CudaNdarrayConstant{[[[[-1.]]]]}, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
   0.9%    13.0%       0.047s       6.70e-03s      7   2145   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sgn,no_inplace}.0)
   0.9%    13.9%       0.047s       6.70e-03s      7   1779   GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}(GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}.0, GpuElemwise{Composite{((((-i0) / i1) / i2) / i3)},no_inplace}.0, CudaNdarrayConstant{[[[[-1.]]]]}, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
   0.9%    14.8%       0.047s       6.70e-03s      7   906   GpuElemwise{sgn,no_inplace}(GpuElemwise{Composite{(((i0 * i1) + i2) + i3)},no_inplace}.0)
   0.9%    15.7%       0.047s       6.70e-03s      7   2138   GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}(GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}.0, GpuElemwise{Composite{((((-i0) / i1) / i2) / i3)},no_inplace}.0, CudaNdarrayConstant{[[[[-1.]]]]}, GpuElemwise{mul,no_inplace}.0, GpuElemwise{sub,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)}}[(0, 0)].0)
   0.9%    16.6%       0.047s       6.70e-03s      7   1025   GpuElemwise{Composite{(((i0 * i1) + i2) + i3)},no_inplace}(GpuElemwise{sub,no_inplace}.0, GpuElemwise{mul,no_inplace}.0, GpuDimShuffle{x,0,x,x}.0, GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}.0)
   0.9%    17.5%       0.047s       6.70e-03s      7   936   GpuElemwise{sub,no_inplace}(GpuDnnConv{algo='small', inplace=True}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
   0.9%    18.4%       0.047s       6.70e-03s      7   976   GpuElemwise{sub,no_inplace}(GpuDnnConv{algo='small', inplace=True}.0, GpuElemwise{Composite{(((i0 / i1) / i2) / i3)},no_inplace}.0)
   0.9%    19.3%       0.047s       6.70e-03s      7   2062   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConvGradI{algo='none', inplace=True}.0, GpuElemwise{sgn,no_inplace}.0)
   ... (remaining 2281 Apply instances account for 80.69%(4.19s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
==================
  Message: Deep_Residual_Learning_CIFAR-10.py:258
  Time in 40 calls to Function.__call__: 1.340600e+01s
  Time in Function.fn.__call__: 1.340600e+01s (100.000%)
  Time in thunks: 9.471744e+00s (70.653%)
  Total compile time: 5.047000e+00s
    Number of Apply nodes: 597
    Theano Optimizer time: 4.375000e+00s
       Theano validate time: 2.189999e-01s
    Theano Linker time (includes C, CUDA code generation/compiling): 5.779998e-01s
       Import time 1.600003e-02s

Time in all call to theano.grad() 1.765000e+00s
Time since theano import 99.867s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  71.5%    71.5%       6.776s       5.46e-03s     C     1240      31   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
  12.5%    84.1%       1.188s       9.58e-04s     C     1240      31   theano.sandbox.cuda.dnn.GpuDnnConv
   9.6%    93.7%       0.907s       3.49e-04s     C     2600      65   theano.sandbox.cuda.basic_ops.GpuElemwise
   2.9%    96.5%       0.273s       3.41e-03s     C       80       2   theano.sandbox.cuda.basic_ops.GpuAlloc
   2.6%    99.2%       0.250s       1.56e-03s     C      160       4   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.3%    99.5%       0.031s       3.91e-04s     C       80       2   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.2%    99.7%       0.016s       1.09e-05s     C     1440      36   theano.tensor.opt.MakeVector
   0.2%    99.8%       0.016s       1.95e-04s     C       80       2   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.2%   100.0%       0.016s       1.95e-04s     C       80       2   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.0%   100.0%       0.000s       0.00e+00s     C     5120     128   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       0.00e+00s     C     5000     125   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.000s       0.00e+00s     C     2720      68   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.000s       0.00e+00s     C     2480      62   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.0%   100.0%       0.000s       0.00e+00s     C     1240      31   theano.sandbox.cuda.dnn.GpuDnnConvDesc
   0.0%   100.0%       0.000s       0.00e+00s     C       80       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.000s       0.00e+00s     C       40       1   theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
   0.0%   100.0%       0.000s       0.00e+00s     C       40       1   theano.sandbox.cuda.basic_ops.GpuReshape
   0.0%   100.0%       0.000s       0.00e+00s     C       40       1   theano.tensor.elemwise.DimShuffle
   0.0%   100.0%       0.000s       0.00e+00s     C       40       1   theano.tensor.basic.MaxAndArgmax
   0.0%   100.0%       0.000s       0.00e+00s     C       40       1   theano.sandbox.cuda.blas.GpuDot22
   ... (remaining 1 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  71.5%    71.5%       6.776s       5.46e-03s     C     1240       31   GpuAllocEmpty
  12.5%    84.1%       1.188s       9.58e-04s     C     1240       31   GpuDnnConv{algo='small', inplace=True}
   4.6%    88.7%       0.438s       6.84e-04s     C      640       16   GpuElemwise{Composite{(i0 * (Composite{(((i0 - i1) * i2) + i3)}(i1, i2, i3, i4) + Abs(Composite{(((i0 - i1) * i2) + i3)}(i1, i2, i3, i4))))}}[(0, 1)]
   4.5%    93.2%       0.422s       7.03e-04s     C      600       15   GpuElemwise{Composite{(i0 * (Composite{((((i0 - i1) * i2) + i3) + i4)}(i1, i2, i3, i4, i5) + Abs(Composite{((((i0 - i1) * i2) + i3) + i4)}(i1, i2, i3, i4, i5))))}}[(0, 1)]
   2.9%    96.0%       0.273s       3.41e-03s     C       80        2   GpuAlloc{memset_0=True}
   2.6%    98.7%       0.250s       1.56e-03s     C      160        4   GpuFromHost
   0.5%    99.2%       0.047s       3.78e-05s     C     1240       31   GpuElemwise{mul,no_inplace}
   0.3%    99.5%       0.031s       3.91e-04s     C       80        2   GpuIncSubtensor{InplaceSet;::, int64:int64:, int64:int64:, int64:int64:}
   0.2%    99.7%       0.016s       1.09e-05s     C     1440       36   MakeVector{dtype='int64'}
   0.2%    99.8%       0.016s       1.95e-04s     C       80        2   HostFromGpu
   0.2%   100.0%       0.016s       3.90e-04s     C       40        1   GpuCAReduce{add}{0,0,1}
   0.0%   100.0%       0.000s       0.00e+00s     C     4960      124   GpuDimShuffle{x,0,x,x}
   0.0%   100.0%       0.000s       0.00e+00s     C     2480       62   GpuContiguous
   0.0%   100.0%       0.000s       0.00e+00s     C     2320       58   Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}
   0.0%   100.0%       0.000s       0.00e+00s     C     1280       32   Shape_i{2}
   0.0%   100.0%       0.000s       0.00e+00s     C     1280       32   Shape_i{0}
   0.0%   100.0%       0.000s       0.00e+00s     C     1280       32   Shape_i{3}
   0.0%   100.0%       0.000s       0.00e+00s     C     1280       32   Shape_i{1}
   0.0%   100.0%       0.000s       0.00e+00s     C     1160       29   GpuDnnConvDesc{border_mode='half', subsample=(1, 1), conv_mode='cross', precision='float32'}
   0.0%   100.0%       0.000s       0.00e+00s     C      160        4   Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}
   ... (remaining 19 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
   4.3%     4.3%       0.408s       1.02e-02s     40   426   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   4.2%     8.5%       0.400s       1.00e-02s     40   430   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   4.2%    12.7%       0.397s       9.93e-03s     40   422   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   4.0%    16.8%       0.382s       9.55e-03s     40   424   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   3.9%    20.6%       0.366s       9.14e-03s     40   442   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   3.8%    24.4%       0.363s       9.07e-03s     40   436   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   3.8%    28.3%       0.362s       9.04e-03s     40   438   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   3.8%    32.1%       0.360s       8.99e-03s     40   432   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   3.7%    35.8%       0.351s       8.77e-03s     40   428   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   3.5%    39.3%       0.332s       8.31e-03s     40   434   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   3.4%    42.6%       0.318s       7.96e-03s     40   440   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.5%    45.1%       0.236s       5.89e-03s     40   462   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.5%    47.6%       0.235s       5.86e-03s     40     4   GpuFromHost(inputs)
   2.2%    49.8%       0.206s       5.16e-03s     40   452   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.2%    51.9%       0.205s       5.12e-03s     40   458   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.0%    54.0%       0.192s       4.81e-03s     40   454   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.0%    55.9%       0.187s       4.69e-03s     40   450   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.9%    57.8%       0.179s       4.48e-03s     40   289   GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Shape_i{0}.0, TensorConstant{32}, TensorConstant{16}, TensorConstant{16})
   1.9%    59.7%       0.177s       4.43e-03s     40   444   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.8%    61.5%       0.174s       4.34e-03s     40   446   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}.0)
   ... (remaining 577 Apply instances account for 38.46%(3.64s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
==================
  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
  Time in 47 calls to Function.__call__: 2.086000e+01s
  Time in Function.fn.__call__: 2.086000e+01s (100.000%)
  Time in thunks: 1.466729e+01s (70.313%)
  Total compile time: 6.165800e+01s
    Number of Apply nodes: 2301
    Theano Optimizer time: 5.476600e+01s
       Theano validate time: 5.234000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.375000e+00s
       Import time 3.120000e-01s

Time in all call to theano.grad() 1.765000e+00s
Time since theano import 99.945s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  49.5%    49.5%       7.255s       4.60e-03s     C     1576      79   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
  33.6%    83.1%       4.936s       5.00e-04s     C     9866    1103   theano.sandbox.cuda.basic_ops.GpuElemwise
   8.3%    91.4%       1.219s       8.37e-04s     C     1457      62   theano.sandbox.cuda.dnn.GpuDnnConv
   2.1%    93.6%       0.313s       1.69e-04s     C     1844     254   theano.sandbox.cuda.basic_ops.GpuCAReduce
   2.1%    95.6%       0.304s       3.01e-03s     C      101       5   theano.sandbox.cuda.basic_ops.GpuAlloc
   2.0%    97.7%       0.297s       1.42e-03s     C      209      11   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.9%    98.5%       0.125s       5.96e-04s     C      210      30   theano.sandbox.cuda.dnn.GpuDnnConvGradI
   0.7%    99.3%       0.109s       5.04e-04s     C      217      31   theano.sandbox.cuda.dnn.GpuDnnConvGradW
   0.3%    99.6%       0.047s       4.34e-04s     C      108       6   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.2%    99.8%       0.031s       1.47e-05s     C     2119     133   theano.tensor.opt.MakeVector
   0.1%    99.9%       0.016s       2.19e-06s     C     7128     429   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.1%   100.0%       0.016s       1.45e-04s     C      108       6   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.0%   100.0%       0.000s       0.00e+00s     C     6016     256   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       0.00e+00s     C     3439     199   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.0%   100.0%       0.000s       0.00e+00s     C     3238     142   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.000s       0.00e+00s     C     1884     123   theano.sandbox.cuda.dnn.GpuDnnConvDesc
   0.0%   100.0%       0.000s       0.00e+00s     C      171      15   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%   100.0%       0.000s       0.00e+00s     C       54       3   theano.sandbox.cuda.basic_ops.GpuReshape
   0.0%   100.0%       0.000s       0.00e+00s     C       54       3   theano.sandbox.cuda.blas.GpuDot22
   0.0%   100.0%       0.000s       0.00e+00s     C       47       2   theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
   ... (remaining 6 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  49.5%    49.5%       7.255s       4.60e-03s     C     1576       79   GpuAllocEmpty
   8.3%    57.8%       1.219s       8.37e-04s     C     1457       62   GpuDnnConv{algo='small', inplace=True}
   5.5%    63.3%       0.810s       8.57e-04s     C      945      135   GpuElemwise{sub,no_inplace}
   3.1%    66.4%       0.457s       2.18e-03s     C      210       30   GpuElemwise{sgn,no_inplace}
   3.0%    69.4%       0.438s       6.84e-04s     C      640       16   GpuElemwise{Composite{(i0 * (Composite{(((i0 - i1) * i2) + i3)}(i1, i2, i3, i4) + Abs(Composite{(((i0 - i1) * i2) + i3)}(i1, i2, i3, i4))))}}[(0, 1)]
   2.9%    72.3%       0.422s       7.03e-04s     C      600       15   GpuElemwise{Composite{(i0 * (Composite{((((i0 - i1) * i2) + i3) + i4)}(i1, i2, i3, i4, i5) + Abs(Composite{((((i0 - i1) * i2) + i3) + i4)}(i1, i2, i3, i4, i5))))}}[(0, 1)]
   2.8%    75.1%       0.411s       1.95e-03s     C      210       30   GpuElemwise{Composite{(((i0 + i1) + ((i2 * i3 * i4) / i5)) + i6)},no_inplace}
   2.6%    77.7%       0.380s       1.75e-03s     C      217       31   GpuElemwise{Composite{((i0 * i1 * i2) + (i0 * i1 * i2 * i3))},no_inplace}
   2.3%    79.9%       0.333s       1.53e-03s     C      217       31   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * i2))},no_inplace}
   2.0%    82.0%       0.300s       1.38e-03s     C      217       31   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
   2.0%    84.0%       0.297s       1.42e-03s     C      209       11   GpuFromHost
   2.0%    86.0%       0.289s       3.07e-03s     C       94        4   GpuAlloc{memset_0=True}
   1.9%    87.9%       0.282s       1.30e-03s     C      217       31   GpuElemwise{Composite{(((i0 * i1 * i2) + (i0 * i1 * i2 * i3)) * i4)},no_inplace}
   1.5%    89.4%       0.220s       1.01e-03s     C      217       31   GpuElemwise{Composite{((i0 * i1) / i2)},no_inplace}
   1.4%    90.8%       0.203s       1.81e-03s     C      112       16   GpuElemwise{Composite{((i0 * i1) + i2)},no_inplace}
   1.3%    92.1%       0.188s       1.73e-04s     C     1085      155   GpuCAReduce{add}{1,0,1,1}
   1.2%    93.2%       0.172s       7.42e-05s     C     2318      185   GpuElemwise{mul,no_inplace}
   1.1%    94.4%       0.165s       1.57e-03s     C      105       15   GpuElemwise{Composite{(((i0 * i1) + i2) + i3)},no_inplace}
   0.9%    95.2%       0.125s       5.96e-04s     C      210       30   GpuDnnConvGradI{algo='none', inplace=True}
   0.7%    96.0%       0.109s       5.04e-04s     C      217       31   GpuDnnConvGradW{algo='none', inplace=True}
   ... (remaining 73 Ops account for   4.05%(0.59s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
   2.8%     2.8%       0.408s       1.02e-02s     40   426   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.7%     5.5%       0.400s       1.00e-02s     40   430   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.7%     8.2%       0.397s       9.93e-03s     40   422   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.6%    10.8%       0.382s       9.55e-03s     40   424   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.5%    13.3%       0.366s       9.14e-03s     40   442   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.5%    15.8%       0.363s       9.07e-03s     40   436   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.5%    18.3%       0.362s       9.04e-03s     40   438   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.5%    20.7%       0.360s       8.99e-03s     40   432   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.4%    23.1%       0.351s       8.77e-03s     40   428   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.3%    25.4%       0.332s       8.31e-03s     40   434   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   2.2%    27.5%       0.318s       7.96e-03s     40   440   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.6%    29.1%       0.236s       5.89e-03s     40   462   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.6%    30.7%       0.235s       5.86e-03s     40     4   GpuFromHost(inputs)
   1.4%    32.1%       0.206s       5.16e-03s     40   452   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.4%    33.5%       0.205s       5.12e-03s     40   458   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.3%    34.9%       0.192s       4.81e-03s     40   454   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.3%    36.1%       0.187s       4.69e-03s     40   450   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.2%    37.4%       0.179s       4.48e-03s     40   289   GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Shape_i{0}.0, TensorConstant{32}, TensorConstant{16}, TensorConstant{16})
   1.2%    38.6%       0.177s       4.43e-03s     40   444   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i3) + i3)}}.0)
   1.2%    39.7%       0.174s       4.34e-03s     40   446   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}.0, Elemwise{Composite{((((i0 + (i1 * (i2 // i1))) - i2) // i1) + i3)}}.0)
   ... (remaining 2878 Apply instances account for 60.26%(8.84s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

@f0k
Copy link
Member

f0k commented May 19, 2016

Lasagne's CNN MNIST example on my machine takes 9 seconds per iteration with cuDNN, and 17 without. I could not find a reference time, so can only guess that this is as expected.

As I said above (#57 (comment)), python examples/mnist.py cnn 10 should take under one second per epoch on a Titan X. On my desktop (GT 640), the Lasagne MNIST example for the CNN architecture takes 8.7s per epoch, and on a GTX 780 Ti it takes 1.1s per epoch. So 9 seconds is a lot too much for your hardware. Are you the only one running code on that GPU, or are there other concurrent experiments running?

I ran the profiling (1 epoch, using 1,000 instead of all 100,000 training examples), and looks like it only uses GPU operations:

There are no CPU Ops, right. But the runtime is dominated by GpuAllocEmpty -- this is not expected.

  1. Can you try enabling CNMeM? THEANO_FLAGS=lib.cnmem=0.3,profile=1 CUDA_LAUNCH_BLOCKING=1 python Deep_Residual_Learning_CIFAR-10.py?
  2. Can you profile the Lasagne MNIST example as well? THEANO_FLAGS=profile=1 CUDA_LAUNCH_BLOCKING=1 python examples/mnist.py cnn 10?
  3. Can you update Theano and Lasagne to the bleeding-edge versions and try again? (http://lasagne.readthedocs.io/en/latest/user/installation.html#bleeding-edge-version)

@PatrickBue
Copy link

Wow, that was it: after adding the cnmem=0.3 flag, the time per epoch went down to 95 seconds for CIFAR, and 0.4 seconds for MNIST. Thanks a lot!
Is there any profiling you want me to do?

@f0k
Copy link
Member

f0k commented May 19, 2016

Is there any profiling you want me to do?

I'd still like to see the profile for the MNIST example without cnmem, if possible: CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=lib.cnmem=0,allow_gc=1,profile=1 python examples/mnist.py cnn 10

For me, this increases per-epoch time from 1.1s to 1.6s, but not to 9s as for your machine. There's still something fishy!

@kli-casia
Copy link
Author

OMG, I even didn't notice the cnmem flag before...
I am also using Ubuntu14.04 + The latest version of theano and lasagne + TITIAN X + cudnn v5.

when cnmem is disabled, the time per epoch for Deep_Residual_Learning_CIFAR-10.py is 170 seconds.
when cnmem=0.3, the time per epoch went down to 104 seconds, huge speed up!

@PatrickBue , you get 95 seconds one epoch, faster than me...

It's better to enable cnmem when train CNN, am I right? @f0k

@PatrickBue
Copy link

@kli-nlpr: Try these flags here. Brings the 95 seconds per epoch down to 82 on my system:

[dnn.conv]
algo_bwd_filter = time_once
algo_bwd_data = time_once
algo_fwd = time_once

@f0k
Copy link
Member

f0k commented May 19, 2016

OMG, I even didn't notice the cnmem flag before...

It should be set by default in future, but they didn't do it yet.

It's better to enable cnmem when train CNN, am I right?

Yes, as we recommend here: https://github.com/Lasagne/Lasagne/wiki/From-Zero-to-Lasagne-on-Ubuntu-14.04#cudnn

If you can spare the memory, you can also set allow_gc=0 for some minor extra speedup. It will leave temporaries allocated between Ops and function calls instead of deallocating and allocating them each time. With cnMeM, allocation and deallocation is a lot faster, so it doesn't make a lot of difference, but it can still improve performance.

@PatrickBue
Copy link

@f0k, this is the profile output for mnist.py using the flags you suggested:

Loading data...
Building model and compiling functions...
Starting training...
Epoch 1 of 10 took 9.963s
  training loss:        1.544144
  validation loss:      0.337871
  validation accuracy:      91.11 %
Epoch 2 of 10 took 9.747s
  training loss:        0.496699
  validation loss:      0.178585
  validation accuracy:      95.02 %
Epoch 3 of 10 took 9.661s
  training loss:        0.338563
  validation loss:      0.131325
  validation accuracy:      96.22 %
Epoch 4 of 10 took 9.877s
  training loss:        0.277647
  validation loss:      0.113841
  validation accuracy:      96.78 %
Epoch 5 of 10 took 9.854s
  training loss:        0.240854
  validation loss:      0.098375
  validation accuracy:      97.19 %
Epoch 6 of 10 took 9.697s
  training loss:        0.215437
  validation loss:      0.087286
  validation accuracy:      97.51 %
Epoch 7 of 10 took 9.794s
  training loss:        0.191024
  validation loss:      0.079255
  validation accuracy:      97.65 %
Epoch 8 of 10 took 9.742s
  training loss:        0.179124
  validation loss:      0.072846
  validation accuracy:      97.88 %
Epoch 9 of 10 took 9.578s
  training loss:        0.168340
  validation loss:      0.068834
  validation accuracy:      97.94 %
Epoch 10 of 10 took 9.868s
  training loss:        0.156605
  validation loss:      0.067654
  validation accuracy:      98.06 %
Final results:
  test loss:            0.061727
  test accuracy:        97.99 %
DEBUG: nvcc STDOUT mod.cu

   Creating library C:/Users/pabuehle/AppData/Local/Theano/compiledir_Windows-10-10.0.10586-Intel64_Family_6_Model_63_Stepping_2_GenuineIntel-2.7.11-64/tmpom4ivu/265abc51f7c376c224983485238ff1a5.lib and object C:/Users/pabuehle/AppData/Local/Theano/compiledir_Windows-10-10.0.10586-Intel64_Family_6_Model_63_Stepping_2_GenuineIntel-2.7.11-64/tmpom4ivu/265abc51f7c376c224983485238ff1a5.exp


Using gpu device 0: GeForce GTX TITAN X (CNMeM is disabled, cuDNN 5005)
Function profiling
==================
  Message: mnist.py:283
  Time in 1000 calls to Function.__call__: 9.057300e+01s
  Time in Function.fn.__call__: 9.040100e+01s (99.810%)
  Time in thunks: 6.183903e+01s (68.275%)
  Total compile time: 1.594000e+00s
    Number of Apply nodes: 124
    Theano Optimizer time: 1.047000e+00s
       Theano validate time: 7.800007e-02s
    Theano Linker time (includes C, CUDA code generation/compiling): 5.000000e-01s
       Import time 3.580000e-01s

Time in all call to theano.grad() 3.100014e-02s
Time since theano import 113.055s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  29.0%    29.0%      17.947s       5.13e-04s     C    35000      35   theano.sandbox.cuda.basic_ops.GpuElemwise
  23.8%    52.8%      14.699s       2.94e-03s     C     5000       5   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
  23.2%    75.9%      14.317s       7.16e-03s     C     2000       2   theano.sandbox.cuda.dnn.GpuDnnPoolGrad
   6.7%    82.6%       4.112s       2.06e-03s     C     2000       2   theano.sandbox.cuda.dnn.GpuDnnPool
   3.5%    86.1%       2.142s       2.14e-03s     C     1000       1   theano.sandbox.cuda.dnn.GpuDnnConvGradI
   2.8%    88.9%       1.752s       8.76e-04s     C     2000       2   theano.sandbox.cuda.dnn.GpuDnnConvGradW
   2.5%    91.4%       1.535s       3.84e-04s     C     4000       4   theano.sandbox.cuda.blas.GpuDot22
   2.5%    93.8%       1.531s       7.66e-04s     C     2000       2   theano.sandbox.cuda.dnn.GpuDnnConv
   2.1%    95.9%       1.298s       2.60e-04s     C     5000       5   theano.sandbox.cuda.basic_ops.GpuCAReduce
   1.3%    97.3%       0.815s       4.07e-04s     C     2000       2   theano.sandbox.cuda.blas.GpuDot22Scalar
   1.2%    98.5%       0.737s       2.46e-04s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.9%    99.4%       0.564s       2.82e-04s     C     2000       2   theano.sandbox.rng_mrg.GPU_mrg_uniform
   0.2%    99.5%       0.109s       1.09e-04s     C     1000       1   theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
   0.2%    99.7%       0.094s       9.38e-05s     C     1000       1   theano.sandbox.cuda.nnet.GpuCrossentropySoftmax1HotWithBiasDx
   0.1%    99.8%       0.078s       7.81e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.047s       9.38e-06s     C     5000       5   theano.sandbox.cuda.dnn.GpuDnnConvDesc
   0.1%   100.0%       0.047s       4.69e-06s     C    10000      10   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.016s       2.23e-06s     C     7000       7   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.000s       0.00e+00s     C    13000      13   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       0.00e+00s     C    11000      11   theano.sandbox.cuda.basic_ops.GpuContiguous
   ... (remaining 2 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  23.8%    23.8%      14.699s       2.94e-03s     C     5000        5   GpuAllocEmpty
  23.2%    46.9%      14.317s       7.16e-03s     C     2000        2   GpuDnnPoolGrad{mode='max'}
  19.7%    66.6%      12.188s       6.09e-03s     C     2000        2   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
   6.7%    73.3%       4.112s       2.06e-03s     C     2000        2   GpuDnnPool{mode='max'}
   3.5%    76.7%       2.142s       2.14e-03s     C     1000        1   GpuDnnConvGradI{algo='none', inplace=True}
   2.8%    79.6%       1.762s       5.87e-04s     C     3000        3   GpuElemwise{Add}[(0, 0)]
   2.8%    82.4%       1.752s       8.76e-04s     C     2000        2   GpuDnnConvGradW{algo='none', inplace=True}
   2.5%    84.9%       1.535s       3.84e-04s     C     4000        4   GpuDot22
   2.5%    87.4%       1.531s       7.66e-04s     C     2000        2   GpuDnnConv{algo='small', inplace=True}
   2.2%    89.6%       1.391s       6.96e-04s     C     2000        2   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)]
   1.6%    91.3%       1.016s       5.08e-04s     C     2000        2   GpuCAReduce{add}{1,0,1,1}
   1.3%    92.6%       0.815s       4.07e-04s     C     2000        2   GpuDot22Scalar
   1.2%    93.8%       0.737s       2.46e-04s     C     3000        3   GpuFromHost
   1.0%    94.7%       0.595s       5.95e-04s     C     1000        1   GpuElemwise{mul,no_inplace}
   0.9%    95.6%       0.531s       6.64e-05s     C     8000        8   GpuElemwise{Composite{((i0 * i1) + (i2 - i3))}}[(0, 2)]
   0.9%    96.5%       0.531s       6.64e-05s     C     8000        8   GpuElemwise{Composite{((i0 * i1) - i2)}}[(0, 1)]
   0.7%    97.2%       0.423s       4.23e-04s     C     1000        1   GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}
   0.5%    97.7%       0.330s       3.30e-04s     C     1000        1   GpuElemwise{Composite{((i0 + Abs(i0)) * i1)},no_inplace}
   0.5%    98.2%       0.306s       6.11e-05s     C     5000        5   GpuElemwise{Mul}[(0, 1)]
   0.4%    98.6%       0.235s       1.17e-04s     C     2000        2   GpuCAReduce{add}{1,0}
   ... (remaining 27 Ops account for   1.44%(0.89s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  21.1%    21.1%      13.061s       1.31e-02s   1000   114   GpuDnnPoolGrad{mode='max'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
  17.2%    38.3%      10.613s       1.06e-02s   1000    46   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0)
  16.9%    55.2%      10.458s       1.05e-02s   1000    56   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
   5.8%    61.0%       3.596s       3.60e-03s   1000    60   GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
   4.3%    65.3%       2.659s       2.66e-03s   1000    33   GpuAllocEmpty(Shape_i{0}.0, Shape_i{1}.0, TensorConstant{12}, TensorConstant{12})
   3.5%    68.8%       2.142s       2.14e-03s   1000   108   GpuDnnConvGradI{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
   2.8%    71.6%       1.730s       1.73e-03s   1000    64   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
   2.3%    73.9%       1.438s       1.44e-03s   1000    53   GpuElemwise{Add}[(0, 0)](GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
   2.2%    76.1%       1.365s       1.36e-03s   1000    47   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
   2.0%    78.1%       1.256s       1.26e-03s   1000   102   GpuDnnPoolGrad{mode='max'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
   1.9%    80.1%       1.188s       1.19e-03s   1000   115   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnPoolGrad{mode='max'}.0, GpuElemwise{Add}[(0, 0)].0)
   1.5%    81.5%       0.907s       9.07e-04s   1000   119   GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.01}, Constant{0.0})
   1.5%    83.0%       0.906s       9.06e-04s   1000    62   GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
   1.4%    84.4%       0.891s       8.91e-04s   1000   116   GpuCAReduce{add}{1,0,1,1}(GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)].0)
   1.4%    85.8%       0.845s       8.45e-04s   1000   107   GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.01}, Constant{0.0})
   1.2%    87.0%       0.722s       7.22e-04s   1000     8   GpuFromHost(inputs)
   1.2%    88.1%       0.720s       7.20e-04s   1000    70   GpuDot22(GpuReshape{2}.0, W)
   1.0%    89.1%       0.625s       6.25e-04s   1000    50   GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
   1.0%    90.1%       0.595s       5.95e-04s   1000    67   GpuElemwise{mul,no_inplace}(CudaNdarrayConstant{[[[[ 2.]]]]}, GpuDnnPool{mode='max'}.0, GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)].0)
   0.9%    91.0%       0.565s       5.65e-04s   1000    91   GpuDot22Scalar(GpuDimShuffle{1,0}.0, GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 0)].0, TensorConstant{0.00999999977648})
   ... (remaining 104 Apply instances account for 8.99%(5.56s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
==================
  Message: mnist.py:286
  Time in 220 calls to Function.__call__: 6.842000e+00s
  Time in Function.fn.__call__: 6.810000e+00s (99.532%)
  Time in thunks: 4.580976e+00s (66.954%)
  Total compile time: 4.219999e-01s
    Number of Apply nodes: 60
    Theano Optimizer time: 3.130000e-01s
       Theano validate time: 1.600003e-02s
    Theano Linker time (includes C, CUDA code generation/compiling): 7.800007e-02s
       Import time 0.000000e+00s

Time in all call to theano.grad() 3.100014e-02s
Time since theano import 113.055s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  57.5%    57.5%       2.635s       5.99e-03s     C      440       2   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
  19.8%    77.4%       0.909s       2.07e-03s     C      440       2   theano.sandbox.cuda.dnn.GpuDnnPool
   8.2%    85.5%       0.375s       3.41e-04s     C     1100       5   theano.sandbox.cuda.basic_ops.GpuElemwise
   4.6%    90.1%       0.209s       3.17e-04s     C      660       3   theano.sandbox.cuda.basic_ops.GpuFromHost
   4.1%    94.2%       0.188s       4.26e-04s     C      440       2   theano.sandbox.cuda.dnn.GpuDnnConv
   3.8%    98.0%       0.172s       3.91e-04s     C      440       2   theano.sandbox.cuda.blas.GpuDot22
   0.7%    98.6%       0.031s       1.42e-04s     C      220       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
   0.3%    99.0%       0.016s       7.10e-05s     C      220       1   theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
   0.3%    99.3%       0.016s       1.78e-05s     C      880       4   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.3%    99.7%       0.016s       3.55e-05s     C      440       2   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.3%   100.0%       0.016s       7.10e-05s     C      220       1   theano.tensor.basic.MaxAndArgmax
   0.0%   100.0%       0.000s       0.00e+00s     C     2640      12   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       0.00e+00s     C     1760       8   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.000s       0.00e+00s     C     1320       6   theano.sandbox.cuda.basic_ops.GpuContiguous
   0.0%   100.0%       0.000s       0.00e+00s     C     1100       5   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.000s       0.00e+00s     C      440       2   theano.sandbox.cuda.dnn.GpuDnnConvDesc
   0.0%   100.0%       0.000s       0.00e+00s     C      220       1   theano.sandbox.cuda.basic_ops.GpuReshape
   0.0%   100.0%       0.000s       0.00e+00s     C      220       1   theano.tensor.elemwise.Sum
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  57.5%    57.5%       2.635s       5.99e-03s     C      440        2   GpuAllocEmpty
  19.8%    77.4%       0.909s       2.07e-03s     C      440        2   GpuDnnPool{mode='max'}
   7.5%    84.9%       0.344s       5.21e-04s     C      660        3   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)]
   4.6%    89.4%       0.209s       3.17e-04s     C      660        3   GpuFromHost
   4.1%    93.5%       0.188s       4.26e-04s     C      440        2   GpuDnnConv{algo='small', inplace=True}
   3.8%    97.3%       0.172s       3.91e-04s     C      440        2   GpuDot22
   0.7%    98.0%       0.031s       1.42e-04s     C      220        1   GpuCAReduce{add}{1}
   0.7%    98.6%       0.031s       1.42e-04s     C      220        1   GpuElemwise{Add}[(0, 0)]
   0.3%    99.0%       0.016s       7.10e-05s     C      220        1   GpuCrossentropySoftmaxArgmax1HotWithBias
   0.3%    99.3%       0.016s       3.55e-05s     C      440        2   HostFromGpu
   0.3%    99.7%       0.016s       3.55e-05s     C      440        2   GpuDimShuffle{x,0}
   0.3%   100.0%       0.016s       7.10e-05s     C      220        1   MaxAndArgmax
   0.0%   100.0%       0.000s       0.00e+00s     C     1320        6   GpuContiguous
   0.0%   100.0%       0.000s       0.00e+00s     C     1100        5   MakeVector{dtype='int64'}
   0.0%   100.0%       0.000s       0.00e+00s     C      660        3   Shape_i{0}
   0.0%   100.0%       0.000s       0.00e+00s     C      660        3   Shape_i{3}
   0.0%   100.0%       0.000s       0.00e+00s     C      660        3   Shape_i{1}
   0.0%   100.0%       0.000s       0.00e+00s     C      660        3   Shape_i{2}
   0.0%   100.0%       0.000s       0.00e+00s     C      440        2   Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)]
   0.0%   100.0%       0.000s       0.00e+00s     C      440        2   GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}
   ... (remaining 9 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  51.7%    51.7%       2.367s       1.08e-02s    220    35   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0)
  18.4%    70.1%       0.845s       3.84e-03s    220    40   GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
   6.8%    76.9%       0.313s       1.42e-03s    220    38   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
   5.8%    82.8%       0.267s       1.22e-03s    220    36   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
   4.6%    87.3%       0.209s       9.50e-04s    220     4   GpuFromHost(inputs)
   3.4%    90.8%       0.156s       7.10e-04s    220    47   GpuDot22(GpuReshape{2}.0, W)
   2.4%    93.1%       0.109s       4.97e-04s    220    37   GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
   1.7%    94.9%       0.078s       3.55e-04s    220    42   GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
   1.4%    96.2%       0.064s       2.91e-04s    220    45   GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
   0.7%    96.9%       0.031s       1.42e-04s    220    52   GpuCAReduce{add}{1}(GpuCrossentropySoftmaxArgmax1HotWithBias.0)
   0.7%    97.6%       0.031s       1.42e-04s    220    51   GpuElemwise{Add}[(0, 0)](GpuDot22.0, GpuDimShuffle{x,0}.0)
   0.3%    98.0%       0.016s       7.10e-05s    220    50   GpuCrossentropySoftmaxArgmax1HotWithBias(GpuDot22.0, b, GpuFromHost.0)
   0.3%    98.3%       0.016s       7.10e-05s    220    56   HostFromGpu(GpuElemwise{TrueDiv}[(0, 0)].0)
   0.3%    98.6%       0.016s       7.10e-05s    220    49   GpuDot22(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0, W)
   0.3%    99.0%       0.016s       7.10e-05s    220    48   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[ 0.5]]}, GpuDot22.0, GpuDimShuffle{x,0}.0)
   0.3%    99.3%       0.016s       7.10e-05s    220     7   GpuDimShuffle{x,0}(b)
   0.3%    99.7%       0.016s       7.10e-05s    220    43   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
   0.3%   100.0%       0.016s       7.10e-05s    220    55   MaxAndArgmax(HostFromGpu.0, TensorConstant{(1L,) of 1})
   0.0%   100.0%       0.000s       0.00e+00s    220    59   Elemwise{TrueDiv}[(0, 0)](Sum{acc_dtype=int64}.0, Elemwise{Composite{Cast{float32}(Switch(EQ(i0, i1), ((i2 * i0) // i0), i0))}}.0)
   0.0%   100.0%       0.000s       0.00e+00s    220    58   Sum{acc_dtype=int64}(Elemwise{eq,no_inplace}.0)
   ... (remaining 40 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.
Function profiling
==================
  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
  Time in 1220 calls to Function.__call__: 9.741500e+01s
  Time in Function.fn.__call__: 9.721100e+01s (99.791%)
  Time in thunks: 6.642000e+01s (68.183%)
  Total compile time: 2.016000e+00s
    Number of Apply nodes: 124
    Theano Optimizer time: 1.360000e+00s
       Theano validate time: 9.400010e-02s
    Theano Linker time (includes C, CUDA code generation/compiling): 5.780001e-01s
       Import time 3.580000e-01s

Time in all call to theano.grad() 3.100014e-02s
Time since theano import 113.070s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  27.6%    27.6%      18.322s       5.08e-04s     C    36100      40   theano.sandbox.cuda.basic_ops.GpuElemwise
  26.1%    53.7%      17.334s       3.19e-03s     C     5440       7   theano.sandbox.cuda.basic_ops.GpuAllocEmpty
  21.6%    75.2%      14.317s       7.16e-03s     C     2000       2   theano.sandbox.cuda.dnn.GpuDnnPoolGrad
   7.6%    82.8%       5.022s       2.06e-03s     C     2440       4   theano.sandbox.cuda.dnn.GpuDnnPool
   3.2%    86.0%       2.142s       2.14e-03s     C     1000       1   theano.sandbox.cuda.dnn.GpuDnnConvGradI
   2.6%    88.7%       1.752s       8.76e-04s     C     2000       2   theano.sandbox.cuda.dnn.GpuDnnConvGradW
   2.6%    91.2%       1.719s       7.04e-04s     C     2440       4   theano.sandbox.cuda.dnn.GpuDnnConv
   2.6%    93.8%       1.706s       3.84e-04s     C     4440       6   theano.sandbox.cuda.blas.GpuDot22
   2.0%    95.8%       1.329s       2.55e-04s     C     5220       6   theano.sandbox.cuda.basic_ops.GpuCAReduce
   1.4%    97.2%       0.946s       2.59e-04s     C     3660       6   theano.sandbox.cuda.basic_ops.GpuFromHost
   1.2%    98.5%       0.815s       4.07e-04s     C     2000       2   theano.sandbox.cuda.blas.GpuDot22Scalar
   0.8%    99.3%       0.564s       2.82e-04s     C     2000       2   theano.sandbox.rng_mrg.GPU_mrg_uniform
   0.2%    99.5%       0.125s       1.02e-04s     C     1220       2   theano.sandbox.cuda.nnet.GpuCrossentropySoftmaxArgmax1HotWithBias
   0.1%    99.6%       0.094s       9.38e-05s     C     1000       1   theano.sandbox.cuda.nnet.GpuCrossentropySoftmax1HotWithBiasDx
   0.1%    99.8%       0.094s       6.51e-05s     C     1440       3   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.9%       0.047s       8.62e-06s     C     5440       7   theano.sandbox.cuda.dnn.GpuDnnConvDesc
   0.1%    99.9%       0.047s       4.22e-06s     C    11100      15   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.016s       1.76e-06s     C     8880      12   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.016s       1.78e-06s     C     8760      15   theano.tensor.elemwise.Elemwise
   0.0%   100.0%       0.016s       7.10e-05s     C      220       1   theano.tensor.basic.MaxAndArgmax
   ... (remaining 4 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  26.1%    26.1%      17.334s       3.19e-03s     C     5440        7   GpuAllocEmpty
  21.6%    47.7%      14.317s       7.16e-03s     C     2000        2   GpuDnnPoolGrad{mode='max'}
  18.4%    66.0%      12.188s       6.09e-03s     C     2000        2   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}
   7.6%    73.6%       5.022s       2.06e-03s     C     2440        4   GpuDnnPool{mode='max'}
   3.2%    76.8%       2.142s       2.14e-03s     C     1000        1   GpuDnnConvGradI{algo='none', inplace=True}
   2.7%    79.5%       1.793s       5.57e-04s     C     3220        4   GpuElemwise{Add}[(0, 0)]
   2.6%    82.1%       1.752s       8.76e-04s     C     2000        2   GpuDnnConvGradW{algo='none', inplace=True}
   2.6%    84.7%       1.719s       7.04e-04s     C     2440        4   GpuDnnConv{algo='small', inplace=True}
   2.6%    87.3%       1.706s       3.84e-04s     C     4440        6   GpuDot22
   2.1%    89.4%       1.391s       6.96e-04s     C     2000        2   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)]
   1.5%    90.9%       1.016s       5.08e-04s     C     2000        2   GpuCAReduce{add}{1,0,1,1}
   1.4%    92.3%       0.946s       2.59e-04s     C     3660        6   GpuFromHost
   1.2%    93.6%       0.815s       4.07e-04s     C     2000        2   GpuDot22Scalar
   0.9%    94.5%       0.595s       5.95e-04s     C     1000        1   GpuElemwise{mul,no_inplace}
   0.8%    95.3%       0.531s       6.64e-05s     C     8000        8   GpuElemwise{Composite{((i0 * i1) + (i2 - i3))}}[(0, 2)]
   0.8%    96.1%       0.531s       6.64e-05s     C     8000        8   GpuElemwise{Composite{((i0 * i1) - i2)}}[(0, 1)]
   0.6%    96.7%       0.423s       4.23e-04s     C     1000        1   GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}
   0.5%    97.2%       0.344s       5.21e-04s     C      660        3   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)]
   0.5%    97.7%       0.330s       3.30e-04s     C     1000        1   GpuElemwise{Composite{((i0 + Abs(i0)) * i1)},no_inplace}
   0.5%    98.2%       0.306s       6.11e-05s     C     5000        5   GpuElemwise{Mul}[(0, 1)]
   ... (remaining 33 Ops account for   1.84%(1.22s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  19.7%    19.7%      13.061s       1.31e-02s   1000   114   GpuDnnPoolGrad{mode='max'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
  16.0%    35.6%      10.613s       1.06e-02s   1000    46   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0)
  15.7%    51.4%      10.458s       1.05e-02s   1000    56   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
   5.4%    56.8%       3.596s       3.60e-03s   1000    60   GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
   4.0%    60.8%       2.659s       2.66e-03s   1000    33   GpuAllocEmpty(Shape_i{0}.0, Shape_i{1}.0, TensorConstant{12}, TensorConstant{12})
   3.6%    64.4%       2.367s       1.08e-02s    220    35   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0, Elemwise{Composite{((((i0 + i1) - i2) // i3) + i3)}}[(0, 0)].0)
   3.2%    67.6%       2.142s       2.14e-03s   1000   108   GpuDnnConvGradI{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
   2.6%    70.2%       1.730s       1.73e-03s   1000    64   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))},no_inplace}(CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuElemwise{Add}[(0, 0)].0)
   2.2%    72.4%       1.438s       1.44e-03s   1000    53   GpuElemwise{Add}[(0, 0)](GpuDnnConv{algo='small', inplace=True}.0, GpuDimShuffle{x,0,x,x}.0)
   2.1%    74.4%       1.365s       1.36e-03s   1000    47   GpuAllocEmpty(Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0)
   1.9%    76.3%       1.256s       1.26e-03s   1000   102   GpuDnnPoolGrad{mode='max'}(GpuContiguous.0, GpuContiguous.0, GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
   1.8%    78.1%       1.188s       1.19e-03s   1000   115   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuDnnPoolGrad{mode='max'}.0, GpuElemwise{Add}[(0, 0)].0)
   1.4%    79.5%       0.907s       9.07e-04s   1000   119   GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.01}, Constant{0.0})
   1.4%    80.8%       0.906s       9.06e-04s   1000    62   GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
   1.3%    82.2%       0.891s       8.91e-04s   1000   116   GpuCAReduce{add}{1,0,1,1}(GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)].0)
   1.3%    83.4%       0.845s       8.45e-04s   1000   107   GpuDnnConvGradW{algo='none', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{0.01}, Constant{0.0})
   1.3%    84.7%       0.845s       3.84e-03s    220    40   GpuDnnPool{mode='max'}(GpuContiguous.0, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 2}, TensorConstant{(2L,) of 0})
   1.1%    85.8%       0.722s       7.22e-04s   1000     8   GpuFromHost(inputs)
   1.1%    86.9%       0.720s       7.20e-04s   1000    70   GpuDot22(GpuReshape{2}.0, W)
   0.9%    87.8%       0.625s       6.25e-04s   1000    50   GpuDnnConv{algo='small', inplace=True}(GpuContiguous.0, GpuContiguous.0, GpuAllocEmpty.0, GpuDnnConvDesc{border_mode='valid', subsample=(1, 1), conv_mode='conv', precision='float32'}.0, Constant{1.0}, Constant{0.0})
   ... (remaining 164 Apply instances account for 12.17%(8.09s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  Sorry, no tip for today.

@kli-casia
Copy link
Author

Hey guys, I aggressively change cnmem=0.8,

[global]
floatX = float32
device = gpu2

[dnn.conv]
algo_fwd = time_once
algo_bwd_data = time_once
algo_bwd_filter = time_once

[lib]
cnmem = .8

The time one epoch went down to 76 seconds!
This is crazy!

@f0k
Copy link
Member

f0k commented May 19, 2016

this is the profile output for mnist.py using the flags you suggested:

Interesting. The first function profile is the one for the training function. It seems that not only GpuAllocEmpty is the culprit: it only accounts for 24% of the time, while you see a speedup of 95%. It rather looks like all Ops that do some allocation are affected. I can't reproduce this on my machine, so maybe this is specific to Windows.
@nouiz: Are there still plans to make lib.cnmem=.45 the default? It seems this is really critical on Windows.

Hey guys, I aggressively change cnmem=0.8,
The time one epoch went down to 76 seconds!

This should only make a difference for the very first epoch. If you set it to a smaller fraction of memory than ultimately needed, it will have to do extra allocations when it hits the boundary of reserved memory. After the first epoch, the memory pool should be large enough for all subsequent epochs so there won't be any more allocations (from the perspective of the driver).
If you set it to a larger value than ultimately needed, your process blocks some GPU memory that could have been used by other processes (to run multiple experiments at the same time, on the same device).

You can try to add allow_gc = 0 to the [global] section to see if it makes things a little bit faster still.

@nouiz
Copy link

nouiz commented May 19, 2016

The new back-end have another allocator that is enabled and preallocate 0
by default, but don't release the memory to the driver by default.

So the transition should fix that. There is a flag to make it reserve more
memory at the start.

On Thu, May 19, 2016 at 12:44 PM, Jan Schlüter [email protected]
wrote:

this is the profile output for mnist.py using the flags you suggested:

Interesting. The first function profile is the one for the training
function. It seems that not only GpuAllocEmpty is the culprit: it only
accounts for 24% of the time, while you see a speedup of 95%. It rather
looks like all Ops that do some allocation are affected. I can't
reproduce this on my machine, so maybe this is specific to Windows.
@nouiz https://github.com/nouiz: Are there still plans to make
lib.cnmem=.45 the default? It seems this is really critical on Windows.

Hey guys, I aggressively change cnmem=0.8,
The time one epoch went down to 76 seconds!

This should only make a difference for the very first epoch. If you set it
to a smaller fraction of memory than ultimately needed, it will have to do
extra allocations when it hits the boundary of reserved memory. After the
first epoch, the memory pool should be large enough for all subsequent
epochs so there won't be any more allocations (from the perspective of the
driver).
If you set it to a larger value than ultimately needed, your process
blocks some GPU memory that could have been used by other processes (to run
multiple experiments at the same time, on the same device).

You can try to add allow_gc=0 to the [global] section to see if it makes
things a little bit faster still.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#57 (comment)

@kli-casia
Copy link
Author

add allow_gc = 0 to the [global] section can still speed up a bit.

[global]
allow_gc = 0
floatX = float32
device = gpu2

[dnn.conv]
algo_fwd = time_once
algo_bwd_data = time_once
algo_bwd_filter = time_once

[lib]
cnmem = .45

Now the time for one epoch went down to 74 seconds!
Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants