[MXNET-105] Fix CuDNN performance after code refactor #10116

zheng-da · 2018-03-14T20:33:54Z

Description

This PR tries to fix the performance degradation report in #9874

We observed about 4% performance decrease. There are multiple factors that cause the performance decrease.

The first one is that the refactored code passes more arrays to the backward of BatchNorm.
The second one is that the refactored code needs to reinitialize the CuDNN states in every forward and backward.
The third one is that the refactor code leads to more memory allocation (e.g., creating std::vector) in forward and backward. (However, the test shows that this doesn't cause much performance decrease.)

This PR tries to reduce these overhead.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

cjolivier01 · 2018-03-14T20:37:30Z

src/operator/nn/batch_norm.cc

+  for (uint32_t i = 0; i < out_data.size(); ++i) {
+    out_data[i] = nnvm::NodeEntry{n, i, 0};
+  }
+  std::vector<nnvm::NodeEntry> heads;


Please use reserve()

This code runs to build the computation graph. It only runs once. Do we still need to call reserve()?

yes, please

cjolivier01 · 2018-03-14T20:38:42Z

src/operator/nn/batch_norm.cc

+  // add all the auxiliary data
+  //for (uint32_t i = 0; i < prop.aux_states.size(); ++i) {
+  //  inputs.emplace_back(ptr->inputs[i + prop.arguments.size()]);
+  //}


cjolivier01 · 2018-03-14T20:41:29Z

src/operator/nn/batch_norm.cu

+  std::vector<TBlob> in_data(3);
+  in_data[batchnorm::kData] = inputs[3];
+  in_data[batchnorm::kGamma] = inputs[4];
+  std::vector<TBlob> aux_states(2);


What happens to aux states (running mean and variance)?

CuDNN version doesn't need aux_states. CUDA version does. So aux_states is set properly to run CUDA code.

So, this fix only improves CUDNN operator? Wouldn't we expect the other non-CUDNN operators to also be slower by the same amount?

cjolivier01 · 2018-03-14T20:43:00Z

This needs a JIRA ticket

cjolivier01 · 2018-03-14T20:44:55Z

src/operator/nn/batch_norm.cc

-                                    inputs.begin() + out_data_start);
-    std::vector<NDArray> out_data(inputs.begin() + out_data_start, inputs.end());
-    std::vector<NDArray> in_grad(outputs.begin(), outputs.begin() + 3);
+    static thread_local std::vector<NDArray> out_grad(1);


How does thread_local help here?

Won't these hold a reference to the NDArray's Chunk data indefinitely?

Here I'm trying to avoid memory allocation for std::vector.

But you are right. It potentially causes mem leak.

…r-mxnet into fix_cudnn_perf

piiswrong · 2018-03-16T06:32:53Z

src/operator/nn/batch_norm-inl.h

-                                inputs.begin() + out_data_start);
-  std::vector<TBlob> out_data(inputs.begin() + out_data_start, inputs.end());
-  std::vector<TBlob> in_grad(outputs.begin(), outputs.begin() + 3);
+  static thread_local std::vector<TBlob> out_grad(1);


This is probably too many thread_local.
Why are we copying the vectors in the first place?
why not change the interface of operator Forward/Backward?

good question. I'll do that instead.

zheng-da · 2018-03-17T04:50:51Z

I measured the performance with (opts) and without (no opts) this PR and compare with the commit before #8302 (original) in the master branch. I ran each version 20 times and calculate the average and standard deviation. The performance is measured as image/second. This fix has got close to the original version. It's unclear where is the remaining perf loss.

	opts	no opts	original
avg	5200.78	5033.98	5251.90
std	43.41	64.05	62.53

The command to run the test:

for i in {1..20}; do
python example/image-classification/train_imagenet.py --benchmark 1 --gpu 0,1,2,3,4,5,6,7 --batch-size 1024 --num-epochs 1 --disp -batches 100 --network resnet-v1 --num-layers 50 --data-nthreads 40 --min-random-scale 0.533 --max-random-shear-ratio 0 --max-random-rotate-angle 0 --max-random-h 0 --max-random-l 0 --max-random-s --dtype float16 --kv-store device
done

piiswrong · 2018-03-21T18:43:18Z

@cjolivier01

TaoLv · 2018-03-26T14:36:26Z

src/operator/nn/batch_norm.cu

    })
  }
 #else
+  aux_states[batchnorm::kMovingMean] = inputs[6];
+  aux_states[batchnorm::kMovingVar] = inputs[7];


@zheng-da aux_states is not defined if USE_CUDNN is not enabled. @marcoabreu seems there is no pure cuda ci environment which is not built with cudnn.

i see. i'll update it.

agree. @marcoabreu could you add a CI only with CUDA?

Sure, no problem at all! Compilation only or do we need tests as well?

I think it's better to run the code at least once. We probably don't need to try both Python2 and Python3, something like that.

Done: #10281

* Reduce #inputs/outputs of batchnorm backward. * Pass more arrays to BN. * Make std::vector thread local. * Set inputs of BN backward for other cases. * Fix for other cases. * remove commented code. * fix a potential mem leak. * Fix a compile error in mkldnn. * Fix an error. * reserve space for std::vector. * Fix alignment. * Fix cpp unit test. * Fix BN CPP unit tests. * Fix a compile error. * Fix compilation error. * Move Op signature. * Cache CuDNN conv op. * Fix compile error. * Fix compile error. * Remove thread_local. * Reduce mem alloc when caching cudnn conv. * Fix a lint error. * Cache CuDNN deconv. * Fix lint error.

zheng-da added 5 commits March 13, 2018 17:34

Reduce #inputs/outputs of batchnorm backward.

01a667a

Pass more arrays to BN.

c616839

Make std::vector thread local.

aaea301

Set inputs of BN backward for other cases.

5153679

Fix for other cases.

3b3c959

zheng-da requested a review from cjolivier01 as a code owner March 14, 2018 20:33

cjolivier01 suggested changes Mar 14, 2018

View reviewed changes

cjolivier01 reviewed Mar 14, 2018

View reviewed changes

zheng-da added 2 commits March 14, 2018 20:44

remove commented code.

583751e

fix a potential mem leak.

1e71a10

zheng-da changed the title ~~Fix CuDNN performance after code refactor~~ [MXNET-105] Fix CuDNN performance after code refactor Mar 14, 2018

zheng-da changed the title ~~[MXNET-105] Fix CuDNN performance after code refactor~~ [MXNET-105][WIP] Fix CuDNN performance after code refactor Mar 14, 2018

zheng-da added 9 commits March 14, 2018 21:24

Fix a compile error in mkldnn.

4f27d57

Fix an error.

457f51d

reserve space for std::vector.

38a116b

Fix alignment.

746f604

Fix cpp unit test.

9db50ca

Merge branch 'fix_cudnn_perf' of https://github.com/zheng-da/incubato…

9206dbd

…r-mxnet into fix_cudnn_perf

Fix BN CPP unit tests.

1a7916a

Fix a compile error.

975beb7

Fix compilation error.

048604d

piiswrong reviewed Mar 16, 2018

View reviewed changes

zheng-da added 5 commits March 16, 2018 18:43

Move Op signature.

8fc682d

Cache CuDNN conv op.

b0271a4

Fix compile error.

13ee008

Fix compile error.

b9fb194

Remove thread_local.

c73ac46

Reduce mem alloc when caching cudnn conv.

afc7f75

ThomasDelteil mentioned this pull request Mar 19, 2018

[MXNET-86] Gluon dataloader crash on speech recognition training #10042

Closed

zheng-da added 2 commits March 20, 2018 22:26

Fix a lint error.

f18dc20

Cache CuDNN deconv.

287970f

zheng-da changed the title ~~[MXNET-105][WIP] Fix CuDNN performance after code refactor~~ [MXNET-105] Fix CuDNN performance after code refactor Mar 21, 2018

Fix lint error.

db21715

piiswrong merged commit 46e47cb into apache:master Mar 22, 2018

zheng-da mentioned this pull request Mar 22, 2018

ResNet-50 is slower on Volta since #8302 #9874

Closed

zheng-da deleted the fix_cudnn_perf branch March 24, 2018 05:51

asitstands mentioned this pull request Mar 24, 2018

Build fails with USE_CUDNN = 0 #10235

Closed

TaoLv reviewed Mar 26, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-105] Fix CuDNN performance after code refactor #10116

[MXNET-105] Fix CuDNN performance after code refactor #10116

zheng-da commented Mar 14, 2018 •

edited

Loading

cjolivier01 Mar 14, 2018

zheng-da Mar 14, 2018

cjolivier01 Mar 14, 2018

cjolivier01 Mar 14, 2018

cjolivier01 Mar 14, 2018

zheng-da Mar 14, 2018

cjolivier01 Mar 14, 2018

cjolivier01 commented Mar 14, 2018

cjolivier01 Mar 14, 2018

cjolivier01 Mar 14, 2018 •

edited

Loading

zheng-da Mar 14, 2018

piiswrong Mar 16, 2018 •

edited

Loading

zheng-da Mar 16, 2018

zheng-da commented Mar 17, 2018 •

edited

Loading

piiswrong commented Mar 21, 2018

TaoLv Mar 26, 2018

zheng-da Mar 27, 2018

zheng-da Mar 27, 2018

marcoabreu Mar 27, 2018 •

edited

Loading

zheng-da Mar 27, 2018

marcoabreu Mar 27, 2018

[MXNET-105] Fix CuDNN performance after code refactor #10116

[MXNET-105] Fix CuDNN performance after code refactor #10116

Conversation

zheng-da commented Mar 14, 2018 • edited Loading

Description

Checklist

Essentials

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjolivier01 commented Mar 14, 2018

Choose a reason for hiding this comment

cjolivier01 Mar 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong Mar 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented Mar 17, 2018 • edited Loading

piiswrong commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu Mar 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zheng-da commented Mar 14, 2018 •

edited

Loading

cjolivier01 Mar 14, 2018 •

edited

Loading

piiswrong Mar 16, 2018 •

edited

Loading

zheng-da commented Mar 17, 2018 •

edited

Loading

marcoabreu Mar 27, 2018 •

edited

Loading