Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c #21104

Merged
merged 6 commits into from
Aug 4, 2022

Conversation

DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Jul 22, 2022

Description

PR #20635, which began using the cuDNN v8 backend API for Convolution ops, includes the following line to avoid test_gluon_gpu.py::test_group_conv2d_16c failures that began occurring coincident with the PR:

@with_environment('MXNET_CUDNN_DISABLED_CONV_FWD_ENGINES', '5')  # eng:5 causes test failure on M60

This PR will remove that line by providing a different implementation of the "convolution plan cache" introduced with PR #20635 that is compatible with convolution engine 5. The steps of this PR will be:

  1. Reenable convolution engine 5, and demonstrate a return of test_group_conv2d_16c failures, then
  2. Add the upgrade to the convolution plan cache so that test_group_conv2d_16c passes even with engine 5 use

Further detail:

The cuDNN v8 backend allows one to bypass a lot of CPU processing that might precede kernel launch by first building up and finalizing a convolution execution plan. The plan, which includes a choice of convolution engine and 'knob settings', is then executed efficiently by the call cudnnBackendExecute(cudnn_handle, plan, ...). PR #20635 introduced a cache of plans so that autotuning does not need to be repeated for identically-parameterized convolutions, which are then handled by the same plan even if they exist multiple times in a model or are handled by different GPU workers.

The issue that was discovered for convolution engine 5 is that it caches a cuDNN handle provided during the plan's construction, and does not consider the handle passed as an argument of cudnnBackendExecute(). The result is that the engine's kernels are launched into the stream of the cached handle, and this would be the incorrect stream if the GPU worker executing the plan is different from the one that created the plan. Without the proper stream synchronization, incorrect results may follow.

The contribution of this PR is to effectively include a GPU worker's cudnn handle as part of the key used in the cache lookup. One aspect of the fix though is that if there's a plan cache miss, a plan made by a different worker for the same convolution can be 'cloned' with the proper handle without repeating the autotuning.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

@mxnet-bot
Copy link

Hey @DickJC123 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, unix-cpu, miscellaneous, sanity, centos-cpu, edge, centos-gpu, website, windows-gpu, windows-cpu, unix-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added the pr-work-in-progress PR is still work in progress label Jul 22, 2022
@DickJC123
Copy link
Contributor Author

Step 1 is complete: with engine 5 reenabled, test_group_conv2d_16c fails in https://jenkins.mxnet-ci.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-21104/1/pipeline/142.

[2022-07-22T05:08:33.227Z] >       raise AssertionError(msg)
[2022-07-22T05:08:33.227Z] E       AssertionError: 
[2022-07-22T05:08:33.227Z] E       Items are not equal:
[2022-07-22T05:08:33.227Z] E       Error 92150.218750 exceeds tolerance rtol=1.000000e-05, atol=1.000000e-06 (mismatch at least 0.012555%).
[2022-07-22T05:08:33.227Z] E       Location of maximum error: (3, 13, 36, 16), a=-0.01365239, b=1.00000000

@DickJC123
Copy link
Contributor Author

Hey, @mk-61, I think you're the best one to appreciate the intent of this PR and approve it, having supplied the original cuDNN v8 backend support. Could you do a quick review? Thanks!

The existing failure with the windows-gpu job I have diagnosed as being unrelated to this PR. That problem I'm exploring fixes for in separate PR#21107, which I intend to get merged to master, then merged to this PR before final acceptance.

@mk-61
Copy link
Contributor

mk-61 commented Jul 29, 2022

LGTM

@DickJC123 DickJC123 requested a review from ptrendx August 1, 2022 06:46
@DickJC123 DickJC123 requested a review from szha August 4, 2022 02:59
@DickJC123 DickJC123 changed the title [BUGFIX] [WIP] Reenable fwd conv engine 5 on test_group_conv2d_16c [BUGFIX] Reenable fwd conv engine 5 on test_group_conv2d_16c Aug 4, 2022
Copy link
Member

@ptrendx ptrendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix

@ptrendx ptrendx merged commit 9975ab4 into apache:master Aug 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants