-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c #21104
[BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c #21104
Conversation
Hey @DickJC123 , Thanks for submitting the PR
CI supported jobs: [clang, unix-cpu, miscellaneous, sanity, centos-cpu, edge, centos-gpu, website, windows-gpu, windows-cpu, unix-gpu] Note: |
Step 1 is complete: with engine 5 reenabled, test_group_conv2d_16c fails in https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-21104/1/pipeline/142.
|
Hey, @mk-61, I think you're the best one to appreciate the intent of this PR and approve it, having supplied the original cuDNN v8 backend support. Could you do a quick review? Thanks! The existing failure with the windows-gpu job I have diagnosed as being unrelated to this PR. That problem I'm exploring fixes for in separate PR#21107, which I intend to get merged to master, then merged to this PR before final acceptance. |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the fix
Description
PR #20635, which began using the cuDNN v8 backend API for Convolution ops, includes the following line to avoid
test_gluon_gpu.py::test_group_conv2d_16c
failures that began occurring coincident with the PR:This PR will remove that line by providing a different implementation of the "convolution plan cache" introduced with PR #20635 that is compatible with convolution engine 5. The steps of this PR will be:
test_group_conv2d_16c
failures, thentest_group_conv2d_16c
passes even with engine 5 useFurther detail:
The cuDNN v8 backend allows one to bypass a lot of CPU processing that might precede kernel launch by first building up and finalizing a convolution execution plan. The plan, which includes a choice of convolution engine and 'knob settings', is then executed efficiently by the call
cudnnBackendExecute(cudnn_handle, plan, ...)
. PR #20635 introduced a cache of plans so that autotuning does not need to be repeated for identically-parameterized convolutions, which are then handled by the same plan even if they exist multiple times in a model or are handled by different GPU workers.The issue that was discovered for convolution engine 5 is that it caches a cuDNN handle provided during the plan's construction, and does not consider the handle passed as an argument of cudnnBackendExecute(). The result is that the engine's kernels are launched into the stream of the cached handle, and this would be the incorrect stream if the GPU worker executing the plan is different from the one that created the plan. Without the proper stream synchronization, incorrect results may follow.
The contribution of this PR is to effectively include a GPU worker's cudnn handle as part of the key used in the cache lookup. One aspect of the fix though is that if there's a plan cache miss, a plan made by a different worker for the same convolution can be 'cloned' with the proper handle without repeating the autotuning.
Checklist
Essentials