[BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c #21104

DickJC123 · 2022-07-22T03:34:50Z

Description

PR #20635, which began using the cuDNN v8 backend API for Convolution ops, includes the following line to avoid test_gluon_gpu.py::test_group_conv2d_16c failures that began occurring coincident with the PR:

@with_environment('MXNET_CUDNN_DISABLED_CONV_FWD_ENGINES', '5')  # eng:5 causes test failure on M60

This PR will remove that line by providing a different implementation of the "convolution plan cache" introduced with PR #20635 that is compatible with convolution engine 5. The steps of this PR will be:

Reenable convolution engine 5, and demonstrate a return of test_group_conv2d_16c failures, then
Add the upgrade to the convolution plan cache so that test_group_conv2d_16c passes even with engine 5 use

Further detail:

The cuDNN v8 backend allows one to bypass a lot of CPU processing that might precede kernel launch by first building up and finalizing a convolution execution plan. The plan, which includes a choice of convolution engine and 'knob settings', is then executed efficiently by the call cudnnBackendExecute(cudnn_handle, plan, ...). PR #20635 introduced a cache of plans so that autotuning does not need to be repeated for identically-parameterized convolutions, which are then handled by the same plan even if they exist multiple times in a model or are handled by different GPU workers.

The issue that was discovered for convolution engine 5 is that it caches a cuDNN handle provided during the plan's construction, and does not consider the handle passed as an argument of cudnnBackendExecute(). The result is that the engine's kernels are launched into the stream of the cached handle, and this would be the incorrect stream if the GPU worker executing the plan is different from the one that created the plan. Without the proper stream synchronization, incorrect results may follow.

The contribution of this PR is to effectively include a GPU worker's cudnn handle as part of the key used in the cache lookup. One aspect of the fix though is that if there's a plan cache miss, a plan made by a different worker for the same convolution can be 'cloned' with the proper handle without repeating the autotuning.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

mxnet-bot · 2022-07-22T03:34:54Z

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, unix-cpu, miscellaneous, sanity, centos-cpu, edge, centos-gpu, website, windows-gpu, windows-cpu, unix-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

DickJC123 · 2022-07-22T06:16:35Z

Step 1 is complete: with engine 5 reenabled, test_group_conv2d_16c fails in https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-21104/1/pipeline/142.

[2022-07-22T05:08:33.227Z] >       raise AssertionError(msg)
[2022-07-22T05:08:33.227Z] E       AssertionError: 
[2022-07-22T05:08:33.227Z] E       Items are not equal:
[2022-07-22T05:08:33.227Z] E       Error 92150.218750 exceeds tolerance rtol=1.000000e-05, atol=1.000000e-06 (mismatch at least 0.012555%).
[2022-07-22T05:08:33.227Z] E       Location of maximum error: (3, 13, 36, 16), a=-0.01365239, b=1.00000000

DickJC123 · 2022-07-29T21:23:24Z

Hey, @mk-61, I think you're the best one to appreciate the intent of this PR and approve it, having supplied the original cuDNN v8 backend support. Could you do a quick review? Thanks!

The existing failure with the windows-gpu job I have diagnosed as being unrelated to this PR. That problem I'm exploring fixes for in separate PR#21107, which I intend to get merged to master, then merged to this PR before final acceptance.

mk-61 · 2022-07-29T21:45:26Z

LGTM

src/operator/cudnn_ops.h

…lan_cache

ptrendx

LGTM, thanks for the fix

Reenable fwd conv engine 5 on test_group_conv2d_16c

086bf70

mseth10 added the pr-work-in-progress PR is still work in progress label Jul 22, 2022

DickJC123 added 4 commits July 21, 2022 23:27

Test conv plan fix devel

9e46bce

Fix clang format

2bba4a6

More clang format fixes

6d6de91

switch to using std::vector::data()

ab9ed28

DickJC123 requested a review from ptrendx August 1, 2022 06:46

ptrendx reviewed Aug 2, 2022

View reviewed changes

src/operator/cudnn_ops.h Show resolved Hide resolved

Merge remote-tracking branch 'mxnet/master' into pr_fix_convolution_p…

4cb6809

…lan_cache

DickJC123 requested a review from szha August 4, 2022 02:59

DickJC123 changed the title ~~[BUGFIX] [WIP] Reenable fwd conv engine 5 on test_group_conv2d_16c~~ [BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c Aug 4, 2022

ptrendx approved these changes Aug 4, 2022

View reviewed changes

ptrendx merged commit 9975ab4 into apache:master Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c #21104

[BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c #21104

DickJC123 commented Jul 22, 2022 •

edited

Loading

mxnet-bot commented Jul 22, 2022

DickJC123 commented Jul 22, 2022

DickJC123 commented Jul 29, 2022

mk-61 commented Jul 29, 2022

ptrendx left a comment

[BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c #21104

[BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c #21104

Conversation

DickJC123 commented Jul 22, 2022 • edited Loading

Description

Checklist

Essentials

mxnet-bot commented Jul 22, 2022

DickJC123 commented Jul 22, 2022

DickJC123 commented Jul 29, 2022

mk-61 commented Jul 29, 2022

ptrendx left a comment

Choose a reason for hiding this comment

DickJC123 commented Jul 22, 2022 •

edited

Loading