Fix cudnn Dropout reproducibility #17547

roywei · 2020-02-07T20:17:30Z

Fix #15662
This will replace #16532 with an alternative solution.
Please refer to the discussion in #16532 (comment)

Added GetSeed() in GPU random resource, it's similar to CPU version.
During cudnn dropout, check whether random seed has been changed and reset cudnn dropout descriptor's seed if random seed changed.

Benefit:

This avoided generating a random int (tensor on gpu) on GPU, copy to CPU and and pass it to cudnnDropoutGetStatesSize.
get the seed from gpu random resource is fast, we can afford to do it every forward.

Drawback:
Will be affected by #7410, by default mxnet's random seed is fixed.

roywei · 2020-02-17T18:47:37Z

@DickJC123 @ptrendx could you help take a look again? Thanks

3rdparty/mshadow/mshadow/random.h

src/operator/nn/dropout-inl.h

src/operator/rnn.cc

karan6181 · 2020-03-09T21:17:25Z

@DickJC123 @ptrendx Can you please review this PR once again? Thanks!

apeforest · 2020-03-31T21:23:59Z

@roywei Could you please rebase and update this PR? Thanks!

ChaiBapchya · 2020-04-06T18:59:33Z

@roywei with CI issue fixed. let's rebase to include the fix. Thanks!

access2rohit · 2020-04-07T01:37:16Z

restarted CI jobs. @roywei can you get your PR merged once they pass CI

ChaiBapchya · 2020-04-07T02:11:23Z

@access2rohit without rebasing, the fix #17962 won't be included [for windows-gpu]

roywei · 2020-04-07T03:06:41Z

@ChaiBapchya @access2rohit Rebased! Thanks!

leezu · 2020-04-10T17:47:09Z

src/operator/rnn-inl.h

@@ -1495,7 +1500,7 @@ class RNNOp {
  cudnnRNNInputMode_t input_mode_;
  cudnnDropoutDescriptor_t dropout_desc_;
  Storage::Handle reserve_space_;
-  uint64_t seed_ = 17 + rand() % 4096;  // NOLINT(runtime/threadsafe_fn)


This seed_ is used for rand_r in the forward implementations. That's very bad practice, not portable and should be fixed.
As this PR already passes CI, it's fine to fix it in a follow-up PR. In fac, #17984 is blocked by the usage of rand_r so I may fix it there.

More info: https://channel9.msdn.com/Events/GoingNative/2013/rand-Considered-Harmful

leezu

Thanks!

szha · 2020-04-22T18:19:43Z

src/operator/nn/dropout-inl.h

+      Random<xpu, unsigned> *prnd = ctx.requested[1].get_random<xpu, unsigned>(s);
+      uint64_t rng_seed = prnd->GetSeed();
+      // reset dropout descriptor if rng seed changed.
+      bool reset = seed_ != rng_seed;
+      seed_ = rng_seed;
+      ctx.requested[0].get_cudnn_dropout_desc(&dropout_desc_, s, 1.0f - this->pkeep_,
+          seed_, reset);


@roywei there are more than 1 random number generators in the resources (4 by default) and they could have different seeds. The result of rotating random number generator seeds here is that cudnn dropout state is reinitialized very often. @sxjscience observed that there's significant performance impact because of this.

I believe @roywei is already working on this issue, since this regression has been caught by our benchmark last week

I spent some time looking into it and actually the problem is dropout's forward is entering the re-init cudnn dropout desc logic every time during forward even without my change. This is true for nd/np and gluon, false for symbol dropout. So originally it was already reinitializing in every forward. The reason it does not cause any performance regression is because it's always using the seed_ defined in uint64_t seed_ = 17 + rand() % 4096; and never changed, it won't listen to MXNet's PRNG (which is the problem this PR is trying to fix). So if the seed didn't change, event if you re-init cudnn dropout descriptor, it won't take any time. My PR changed the seed, so it was actually re-init every forward, thus the regression.

However, it works fine under symbol case. My guess is somehow for symbolic, every forward is using the same Dropout node and the state check took effect then didn't go into re-init logic. For imperative case the state check is always empty and went into the re-init logic.

Compare the following two code, one is gluon and one is symbol, if I print some log during reinitialization in the original code without my change here. It will print out every forward in ND/NP/Gluon, not the case in Symbol.

import mxnet as mx data = mx.nd.ones((10, 200, 300, 500), ctx=mx.gpu(0)) dropout = mx.gluon.nn.Dropout(0.5) # with or without hybridize is the same result dropout.hybridize() with mx.autograd.record(): result1 = dropout(data) result2 = dropout(result1)

print 2 times

re-init dropout desc re-init dropout desc

Symbol:

import mxnet as mx data = mx.nd.ones((10, 200, 300, 500), ctx=mx.gpu(0)) net = mx.sym.Variable("data") net = mx.sym.Dropout(data=net, p=0.5, cudnn_off=False) exe = net.simple_bind(mx.gpu(0), data=data.shape) result1 = exe.forward(is_train=True, data=data) result2 = exe.forward(is_train=True, data=result1[0])

print 1 time

re-init dropout desc

Given this situation, I don't have a good solution as checking the state handle size does not work in imperative mode, checking PRNG seed will also not work. I will revert this PR for now as it's causing regressions for models using dropout.

cc @szha @sxjscience @apeforest

@roywei the links in your comment don't seem to point to what they refer to so I'm a bit confused. Where did you insert the 're-init dropout desc' comment?

This reverts commit 249b9a1.

* Revert "Fix cudnn Dropout reproducibility (#17547)" This reverts commit 249b9a1. * fix conflict

* Revert "Fix cudnn Dropout reproducibility (apache#17547)" This reverts commit 249b9a1. * fix conflict

roywei requested review from anirudh2290 and eric-haibin-lin as code owners February 7, 2020 20:17

roywei requested review from ptrendx and apeforest February 7, 2020 20:18

roywei mentioned this pull request Feb 7, 2020

fix dropout gpu seed #16532

Closed

eric-haibin-lin requested a review from DickJC123 February 9, 2020 22:41

apeforest reviewed Feb 23, 2020

View reviewed changes

3rdparty/mshadow/mshadow/random.h Outdated Show resolved Hide resolved

3rdparty/mshadow/mshadow/random.h Outdated Show resolved Hide resolved

src/operator/nn/dropout-inl.h Outdated Show resolved Hide resolved

src/operator/rnn.cc Outdated Show resolved Hide resolved

roywei force-pushed the dropout branch from 9ab8c22 to 2c9383a Compare March 2, 2020 19:11

apeforest approved these changes Mar 4, 2020

View reviewed changes

apeforest mentioned this pull request Mar 4, 2020

bit mask support for dropout oneapi-src/oneDNN#656

Open

roywei force-pushed the dropout branch from a88b44b to 2fee920 Compare April 2, 2020 04:41

roywei added 12 commits April 6, 2020 20:03

init

1d2f7b0

fix

dc2a126

add reset check

5f3e3bf

fix check

8feb88d

add test

41e941a

fix test name

8783636

fix rnn dropout

435c962

move rnn test in separate pr

aa2a96a

address comments

e2c44d6

fix indent

f9b2cab

fix typo

977428b

merge conflict

b8efb3d

roywei force-pushed the dropout branch from a38b896 to b8efb3d Compare April 7, 2020 03:03

leezu reviewed Apr 10, 2020

View reviewed changes

leezu approved these changes Apr 10, 2020

View reviewed changes

leezu merged commit 249b9a1 into apache:master Apr 10, 2020

szha reviewed Apr 22, 2020

View reviewed changes

zheyuye added a commit to zheyuye/incubator-mxnet that referenced this pull request Apr 23, 2020

Revert "Fix cudnn Dropout reproducibility (apache#17547)"

43d546f

This reverts commit 249b9a1.

roywei added a commit to roywei/incubator-mxnet that referenced this pull request Apr 23, 2020

Revert "Fix cudnn Dropout reproducibility (apache#17547)"

46e5ae9

This reverts commit 249b9a1.

roywei mentioned this pull request Apr 23, 2020

Revert #17547 #18145

Merged

roywei added a commit to roywei/incubator-mxnet that referenced this pull request Apr 23, 2020

Revert "Fix cudnn Dropout reproducibility (apache#17547)"

7a86240

This reverts commit 249b9a1.

leezu pushed a commit that referenced this pull request Apr 23, 2020

Revert #17547 (#18145)

f08683e

* Revert "Fix cudnn Dropout reproducibility (#17547)" This reverts commit 249b9a1. * fix conflict

AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020

Revert apache#17547 (apache#18145)

06a39c1

* Revert "Fix cudnn Dropout reproducibility (apache#17547)" This reverts commit 249b9a1. * fix conflict

chinakook pushed a commit to chinakook/mxnet that referenced this pull request Jul 24, 2020

Revert apache#17547 (apache#18145)

915573f

* Revert "Fix cudnn Dropout reproducibility (apache#17547)" This reverts commit 249b9a1. * fix conflict

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cudnn Dropout reproducibility #17547

Fix cudnn Dropout reproducibility #17547

roywei commented Feb 7, 2020

roywei commented Feb 17, 2020

karan6181 commented Mar 9, 2020

apeforest commented Mar 31, 2020

ChaiBapchya commented Apr 6, 2020

access2rohit commented Apr 7, 2020

ChaiBapchya commented Apr 7, 2020

roywei commented Apr 7, 2020

leezu Apr 10, 2020

leezu left a comment

szha Apr 22, 2020

rondogency Apr 22, 2020

roywei Apr 23, 2020 •

edited

Loading

szha Apr 23, 2020

Fix cudnn Dropout reproducibility #17547

Fix cudnn Dropout reproducibility #17547

Conversation

roywei commented Feb 7, 2020

roywei commented Feb 17, 2020

karan6181 commented Mar 9, 2020

apeforest commented Mar 31, 2020

ChaiBapchya commented Apr 6, 2020

access2rohit commented Apr 7, 2020

ChaiBapchya commented Apr 7, 2020

roywei commented Apr 7, 2020

leezu Apr 10, 2020

Choose a reason for hiding this comment

leezu left a comment

Choose a reason for hiding this comment

szha Apr 22, 2020

Choose a reason for hiding this comment

rondogency Apr 22, 2020

Choose a reason for hiding this comment

roywei Apr 23, 2020 • edited Loading

Choose a reason for hiding this comment

szha Apr 23, 2020

Choose a reason for hiding this comment

roywei Apr 23, 2020 •

edited

Loading