Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Speed fused_op compilation by caching ptx and jit-compiled device functions #16783

Merged
merged 1 commit into from
Nov 12, 2019

Conversation

DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Nov 12, 2019

Description

This PR speeds up the dynamic nvrtc-compilation of fused_ops in response to @rondogency's comment #15167 (comment). As reported in the comment, the runtime of 3 mentioned unittests had grown drastically with the fusion enabled to 17.5 minutes in total. With this PR, the runtime drops to 1 minute, with the original fusion-turned-off runtime being 30 seconds.

The process of runtime compilation of NVIDIA gpu kernels involves 2 steps:
- compiling the cuda code to PTX assembly (performed once per GPU architecture)
- translating the ptx assembly to binary and loading it into a GPU's set of runnable kernels (performed once per GPU device). This latter step produces the CUfunction needed to execute the kernel on the device.

After realizing that the slowed-down unittests were creating many identical fused ops, I added a cache of the PTX and CUfunctions. The cache comprises a mapping (for each GPU arch) from the cuda source code to the PTX and to any CUfunctions created from it.

It's worth a reminder that the fusion framework is targeting the typical scenario of creating a model's graph and executing it many times. The CI was adversely impacted because it often executes a model's graph just once after creation.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • [ X] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • [X ] Changes are complete (i.e. I finished coding on this PR)
  • [X ] All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • [X ] Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • [X ] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@DickJC123 DickJC123 requested a review from ptrendx November 12, 2019 03:43
@DickJC123
Copy link
Contributor Author

As reported originally with pointwise fusion enabled:
test_operator_gpu.test_sparse_mathematical_core goes from ~13s to ~350s, test_operator_gpu.test_lstm_bidirectional goes from ~15s to ~450s,
test_operator_gpu.test_rnnrelu_bidirectional goes from ~4s to ~250s.

As timed on this latest passing CI run for centos-gpu:
test_operator_gpu.test_lstm_bidirectional: 39s
test_operator_gpu.test_sparse_mathematical_core: 18s
test_operator_gpu.test_rnnrelu_bidirectional: 6s

@DickJC123
Copy link
Contributor Author

Using the centos-gpu unittest runtime now as a metric:

Before op fusion: 40 minutes
With op fusion (but before this PR): 1hr 40 minutes
With op fusion and this PR to cache compiles: 44 minutes
@larroy @samskalicky

Copy link
Member

@ptrendx ptrendx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ptrendx ptrendx merged commit 2c02bff into apache:master Nov 12, 2019
ptrendx added a commit that referenced this pull request Nov 16, 2019
…, #16792) (#16832)

* Fix nightly build (#16773)

* Remove dependency on tvmop.conf

* Fix binaries dependencies for ni nightly

* Add comments

* Update tvmop.py

* Fix rebase

* Fix (#16781)

* Speed fused_op compilation by caching ptx and jit-compiled functions (#16783)

* [Numpy] Fix collect_params().zero_grad() in gluon numpy interface (#16716)

* fix zero_grad

* Update parameter.py

* add test

* fix

* Mixed data type binary ops (#16699)

* support mixed-precision binary operations

* improvement for documentations and error messages

* Support boolean elemwise/broadcast binary add, multiply and true_divide (#16728)

* support pure boolean elemwise/broadcast binary op

* switch to unique_tpr

* fix the test error

* Fix rtrue_divide grad (#16769)

* Fix rtrue_divide_scalar

* More tests

* Fix numpy-compatible mean output type for integer inputs (#16792)

* fix mean output type for integer inputs

* enable for windows
@ptrendx ptrendx mentioned this pull request Dec 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants