Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Back port optimization to broadcast_axis to MXNet1.x #18773

Merged
merged 2 commits into from
Jul 28, 2020

Conversation

access2rohit
Copy link
Contributor

@access2rohit access2rohit commented Jul 23, 2020

Description

Back port optimization to broadcast_axis for CPU #17882 and GPU #18168 to MXNet1.7.x

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@mxnet-bot
Copy link

Hey @access2rohit , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [sanity, unix-cpu, windows-cpu, centos-gpu, clang, website, miscellaneous, centos-cpu, windows-gpu, unix-gpu, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@access2rohit
Copy link
Contributor Author

@leezu @sandeep-krishnamurthy can you help merge ? These are cherry-picked from master(already merged there)

@access2rohit access2rohit changed the base branch from v1.7.x to v1.x July 23, 2020 02:13
@access2rohit access2rohit changed the title Back port optimization to broadcast_axis to MXNet1.7.x Back port optimization to broadcast_axis to MXNet1.x Jul 23, 2020
@ChaiBapchya
Copy link
Contributor

Looks good since it's a cherry-pick.
For functionality check, broadcast_axis specific tests in unix-cpu should verify that on CI.
For performance check, do you mind pasting comparisons for w/ and w/o this change for 1.x branch just to be sure these perf achievements are valid for 1.x branch as well?

@access2rohit
Copy link
Contributor Author

Looks good since it's a cherry-pick.
For functionality check, broadcast_axis specific tests in unix-cpu should verify that on CI.
For performance check, do you mind pasting comparisons for w/ and w/o this change for 1.x branch just to be sure these perf achievements are valid for 1.x branch as well?

@ChaiBapchya
the results are identical to the one I posted on my original PR. These results haven't changed over time. Do you still want additional runs ?

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Jul 24, 2020

That PR was against master [2.0]. This PR is for 1.x branch [which forked off from master awhile ago]. Running perf before this PR and after this PR & getting the numbers for this PR would ensure we do our due diligence before merging this PR.

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Jul 26, 2020
@access2rohit
Copy link
Contributor Author

Code SSD 1 Epoch time (sec) %age Speedup/Slowdown w.r.t Master (large tensor disabled)
Master (large tensor disabled) 231 0% speedup/slowdown
Master (large tensor enabled) 378 63.23% slowdown
Master + CPU Optimized broadcast_axis (large tensor disabled) 237 2.6% slowdown
Master + CPU Optimized broadcast_axis (large tensor enabled) 234 1.3% slowdown

access2rohit and others added 2 commits July 28, 2020 17:17
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
@ChaiBapchya
Copy link
Contributor

Why is it master? Shouldn't it be v1.x?

@access2rohit
Copy link
Contributor Author

access2rohit commented Jul 28, 2020

Why is it master? Shouldn't it be v1.x?

1.x is master for all future 1.x releases

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@szha szha merged commit 7bef9cb into apache:v1.x Jul 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants