Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Improve performance of broadcast_axis on CPU #17882

Merged
merged 5 commits into from
Jun 29, 2020

Conversation

access2rohit
Copy link
Contributor

@access2rohit access2rohit commented Mar 20, 2020

Description

Improves the performance of broadcast axis by reducing ALU operations and caching stride values. This operator is crucial in performance of SSD which gets slowed down by over 50%(total training time) when MXNet is built with Large Tensor Support enabled.

This PR leverages vectorization or SIMD architecture of the CPU. It is described as follows:

Vectorization

is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction(SIMD), whereas scalar can only operate on pairs of operands at once.

Consider the following very simple loop that adds the elements of two arrays and stores the results to a third array.

for (int i=0; i<16; ++i)
    C[i] = A[i] + B[i];

Vectorizing it, produces something like this:

for (int i=0; i<16; i+=4)
    C[i:i+3] = A[i:i+3] + B[i:i+3];

Note: The key here is that the elements need to be adjacent (so they can be loaded into vector registors. Note:avx 512 ISA does have "gatherd" intstruction for gathering sparse data from scattred locations into avx512 register but that operation is still slow compared to adjacent data being loaded into the register) and there should not be any fundamental dependency(or read after write dependency) in subsequent iterations of the loop. For e.g. the following loop cannot be vectorized because C[i+1] depends on the value of C[i] which needs to bew calculated first.

for (int i=0; i<16; ++i)
    C[i] = A[i] + C[i-1];

because if we unroll the loop and substitute values:

for (int i=0; i<16; i+=4) {
    C[i]   = A[i]   + C[i-1];
     |
     |-------------------------|
                               |
    C[i+1] = A[i+1] + C[i]; <--|
     |
     |---------------------------|
                                 |
    C[i+2] = A[i+2] + C[i+1]; <--|
     |
     |---------------------------|
                                 |
    C[i+3] = A[i+3] + C[i+2]; <--|
}

The arrows above show read after write dependency.the above statements cannot be executed in parallel by loading A[1:4] and C[1:4] into vector registers(here 1:4 is just for illustration purposes depending on the word size and register size it can load more data elements).

How is it different from loop unrolling? Unrolling transforms the loop into something that looks like this:

for (int i=0; i<16; i+=4) {
    C[i]   = A[i]   + B[i];
    C[i+1] = A[i+1] + B[i+1];
    C[i+2] = A[i+2] + B[i+2];
    C[i+3] = A[i+3] + B[i+3];
}

Side Effects

for cases where 1 or very few elements need to be broadcasted to very large no. of locations the performance is not optimized. There is a slowdown when compared to master(int32 w/o large tensor support)

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Performance

Opperf code used to benchmark

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=mx.cpu(),
                               inputs=[{'data': (1000, 1, 100, 1), 'axis': (1, 3), 'size': (10, 5)}],
                               warmup=100, runs=1000, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=mx.cpu(),
                               inputs=[{'data': (1000, 1, 1, 100), 'axis': (1, 2), 'size': (10, 5)}],
                               warmup=100, runs=1000, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=mx.cpu(),
                               inputs=[{'data': (1, 1, 1, 1), 'axis': (0, 1, 2, 3), 'size': (1000, 10, 100, 5)}],
                               warmup=100, runs=1000, profiler='python')
print(add_res)

add_res = run_performance_test(nd.broadcast_axis, run_backward=True, dtype='float32', ctx=mx.cpu(),
                               inputs=[{'data': (1, 1000, 1, 100, 1), 'axis': (0, 2, 4), 'size': (2, 10, 5)}],
                               warmup=100, runs=1000, profiler='python')
print(add_res)

Results

code version cases avg p50 p90
master LT (1000, 1, 100, 1)->(1000, 10, 100, 5) 27.5 26.05 21.47
(1000, 1, 1, 100)->(1000, 10, 5, 100) 29.68 28.59 35.93
(1, 1, 1, 1)->(1000, 10, 100, 5) 129.80 123.30 169.04
(1, 1000, 1, 100, 1)->(2, 1000, 10, 100, 5) 63.60 58.88 74.74
master no-LT (1000, 1, 100, 1)->(1000, 10, 100, 5) 14.30 12.76 18.00
(1000, 1, 1, 100)->(1000, 10, 5, 100) 13.21 12.33 15.67
(1, 1, 1, 1)->(1000, 10, 100, 5) 66.80 60.77 83.80
(1, 1000, 1, 100, 1)->(2, 1000, 10, 100, 5) 31.81 29.71 40.23
new LT (1000, 1, 100, 1)->(1000, 10, 100, 5) 17.53 17.36 22.42
(1000, 1, 1, 100)->(1000, 10, 5, 100) 15.49 14.79 17.64
(1, 1, 1, 1)->(1000, 10, 100, 5) 127.74 126.06 131.19
(1, 1000, 1, 100, 1)->(2, 1000, 10, 100, 5) 39.23 38.95 39.48
new no-LT (1000, 1, 100, 1)->(1000, 10, 100, 5) 9.41 8.38 11.84
(1000, 1, 1, 100)->(1000, 10, 5, 100) 8.29 7.65 10.61
(1, 1, 1, 1)->(1000, 10, 100, 5) 67.54 63.63 86.17
(1, 1000, 1, 100, 1)->(2, 1000, 10, 100, 5) 23.57 23.39 23.92

SSD training performance

Code SSD 1 Epoch time (sec) %age Speedup/Slowdown w.r.t Master (large tensor disabled)
Master (large tensor disabled) 226 0
Master (large tensor enabled) 335 48.23% slowdown
Master + CPU Optimized broadcast_axis (large tensor disabled) 130 42.5% speedup
Master + CPU Optimized broadcast_axis (large tensor enabled) 184 18.5% speedup

@ptrendx
Copy link
Member

ptrendx commented Mar 20, 2020

Is this going to concentrate on CPU performance? GPU performance of broadcast_axis/like/etc is really poor too (few days ago I was measuring it and got ~150 GB/s on V100 GPU out of 900 GB/s peak BW.

@ptrendx
Copy link
Member

ptrendx commented Mar 20, 2020

The biggest speedup you can get here would be from vectorization, so that you don't need to do those index calculations all the time (just once per vector).

@access2rohit access2rohit force-pushed the broadcast_axis_improv branch 4 times, most recently from c565b7c to d088b9f Compare March 27, 2020 17:33
@access2rohit access2rohit changed the title [WIP]Improve performance of broadcast_axis Improve performance of broadcast_axis Apr 6, 2020
@access2rohit access2rohit force-pushed the broadcast_axis_improv branch 2 times, most recently from b3fb413 to aec9dcb Compare April 6, 2020 17:25
@access2rohit access2rohit force-pushed the broadcast_axis_improv branch 2 times, most recently from 71ff22e to 4fe234c Compare April 15, 2020 16:00
@leezu
Copy link
Contributor

leezu commented Apr 15, 2020

@access2rohit can you rebase on latest master to see if that fixes CI?

@apeforest apeforest requested a review from haojin2 April 15, 2020 19:39
Copy link
Contributor

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also take Da's suggestion by using int32_t for GPU kernels?

@access2rohit access2rohit force-pushed the broadcast_axis_improv branch 2 times, most recently from 65f61aa to be8d5b6 Compare April 17, 2020 16:59
@access2rohit
Copy link
Contributor Author

Could we also take Da's suggestion by using int32_t for GPU kernels?

@apeforest
For this op only or all ? For this op only I can take a look.
If for all ops I would like to do that in a separate PR.

@access2rohit
Copy link
Contributor Author

@access2rohit can you rebase on latest master to see if that fixes CI?

Was still failing after rebase. I have fixed the failing case and pushed new changes now.

@apeforest
Copy link
Contributor

Please add more details in the PR description to explain the rationale of this change:
(1) Why make the change
(2) How this change improved performance
(3) Any side effect

@access2rohit
Copy link
Contributor Author

access2rohit commented Apr 17, 2020

Please add more details in the PR description to explain the rationale of this change:

@apeforest

  1. Its there in the PR Heading. Additionally, its being done to improve BERT performance regression when using Large Tensor Build
  2. Leverages vectorization. Can't explain line by line as to how. Too much text to write. I can share a link about vectorization for reviewers to read. Let me know if you meant something else. I can write 1 line each for 3 cases which is necessary
  3. I didn't understand this question. Could you clarify what meant by side-effect.

@access2rohit access2rohit force-pushed the broadcast_axis_improv branch 3 times, most recently from 63f7d39 to 9f66804 Compare April 19, 2020 19:04
@access2rohit
Copy link
Contributor Author

@mxnet-label-bot add [pr-awaiting-review]

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Jun 27, 2020
@access2rohit
Copy link
Contributor Author

@apeforest can you review. I have addressed your comments

@access2rohit
Copy link
Contributor Author

@zheng-da Can you please review?

@access2rohit
Copy link
Contributor Author

@mxnet-label-bot update [pr-awaiting-merge]

@lanking520 lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-review PR is waiting for code review labels Jun 29, 2020
@szha szha merged commit 638622f into apache:master Jun 29, 2020
ys2843 pushed a commit to ys2843/incubator-mxnet that referenced this pull request Jun 29, 2020
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this pull request Jul 6, 2020
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
zheyuye added a commit to zheyuye/incubator-mxnet that referenced this pull request Jul 17, 2020
commit a77f774ed179786fc8429d913a2da1d942528de9
Author: Leonard Lausen <[email protected]>
Date:   Fri Jul 17 05:01:17 2020 +0000

    Remove NNPACK integration (#18722)

commit 3ef00b8840c05c49118705f6fd9663ebb951f3a1
Author: Andrei Ivanov <[email protected]>
Date:   Thu Jul 16 16:57:58 2020 -0700

    Refactoring of Pooled Storage Manager classes (#18582)

    * Refactoring of Pooled Storage Manager classes

    * Adding test for new functionality

    * Fixing compilation problems which appear for MXNET_USE_CUDA=0

    * Fixing compilation problems for WINDOWS and ANDROID

    * Fixing compilation problems which appear for WINDOWS and __APPLE__

    * Fixing lint problems

    * test_dataloader_context(): Bypassing custom_dev_id pinned mem test on system with GPUs < 2.

    * Fixing compilation for Android. Elimination of unused includes.

    * Fixing problems with CPUPinned Storage Manager which appears when MXNET_USE_CUDA = 0

    * Removing test_bucketing.py

    * Imroving CPU_Pinned Pooled Storage Manager case.

    * Fixing lint problem

    * The GPU profiling commands calls moved into mutex area

    * Fixing lint problem

    * Improved reporting regarding the Storage Manager used.

    * Fixing lint problem

    * Trigger CI

    * Removing some comments, as suggested by @szha

    * Trigger CI

    * Trigger CI

    Co-authored-by: andreii <[email protected]>

commit 2abf0b8c2b3361c73c9dfdeabdb8a88278b693d0
Author: Leonard Lausen <[email protected]>
Date:   Thu Jul 16 17:41:22 2020 +0000

    Initialize docker cache in build.py for docker-compose containers (#18724)

commit 37bdf0bf981d11a89bd248b02f473211d57bc9c6
Author: JackieWu <[email protected]>
Date:   Fri Jul 17 01:25:01 2020 +0800

    [MXNET-1453] Support the intput whose dimension is greater than 6 for Transpose and Rollaxis (#18707)

    * support 6+ dims for transpose

    * test over

    * reorder code

    * fix transposeex

commit 8198442f0c7bde0fc47f507c3f81a0b5cf0a5235
Author: AntiZpvoh <[email protected]>
Date:   Thu Jul 16 15:01:59 2020 +0800

    [numpy] symbolic advanced indexing (#18319)

    * add ndarray and boolean indexing for numpy symbol

    * fix sanity and unit test

    * ensure consistency between the imperative and symbolic interface

    * Update python/mxnet/numpy/multiarray.py and add new test
    Co-authored-by: Leonard Lausen <[email protected]>

    * Don't rely on indexing_key_expand_implicit_axes for deciding if
    _npi.advanced_indexing_multiple is applicable

    * fix sanity

    Co-authored-by: Leonard Lausen <[email protected]>

commit 690132516a0a99337625248772fd44930686a82b
Author: 蔡舒起 <[email protected]>
Date:   Thu Jul 16 10:12:20 2020 +0800

    Add the newest mxnet discuss  version. Add d2l.ai (#18663)

    * Add the newest mxnet discuss  version. Add d2l.ai

    * delete [] and insert old version

commit e2366e9102e6862416bf998af52baaa5e9c0a31b
Author: Leonard Lausen <[email protected]>
Date:   Wed Jul 15 22:01:36 2020 +0000

    Refactor scope functionality in Python API (#18619)

    * Refactor scope functionality in Python API

    - Remove deprecated metaclass functionality
    - Remove global state in naming
    - Switch from threading.local to asyncio compatible contextvars
    - Stop exposing UUIDs in parameter name

    * Fix dependencies

    * Fixes

    * Fixes

    * Fix

    * Fix after merge master

commit 12ec04611c78a603c03707488d66bdbbedf0d536
Author: Chaitanya Prakash Bapat <[email protected]>
Date:   Wed Jul 15 13:59:34 2020 -0700

    Migrate from private to public jetson toolchain files (#18677)

commit 0dc30a2c170fd0aa369d325a1feae6aad75a52c2
Author: Leonard Lausen <[email protected]>
Date:   Wed Jul 15 01:02:36 2020 +0000

    Enable GPU Memory profiler tests (#18701)

    * Enable GPU Memory profiler tests

    Previously tests are not run as test_profiler.py was not taken into account on
    GPU CI runs and some tests were marked for being skipped if run on a CPU-only
    machine.

    * Disable broken tests

commit d512814c2981f9bfb23937064634982ca97d0338
Author: Leonard Lausen <[email protected]>
Date:   Wed Jul 15 00:57:38 2020 +0000

    Disable test coverage in MKL builds (#18443)

    * Disable test coverage in MKL builds

    * Enable test parallelization

    * Set OMP_NUM_THREADS

    * Fix

    * Fix unpack_and_init

commit d8430b6b412e637d07b291dbee1350df7168234d
Author: Leonard Lausen <[email protected]>
Date:   Wed Jul 15 00:53:49 2020 +0000

    Set CMAKE_CUDA_COMPILER in aarch64-linux-gnu-toolchain.cmake (#18713)

    CMAKE_CUDA_HOST_COMPILER will be reset if CMAKE_CUDA_COMPILER is not set as of cmake 3.17.3

    See https://gitlab.kitware.com/cmake/cmake/-/issues/20826

commit f125f5fd9ff91e9a70e5add3735c32d4e3bf9cd0
Author: Yang Shi <[email protected]>
Date:   Tue Jul 14 14:29:14 2020 -0700

    Fix all anchor shifts on website (#18674)

commit 7c9c4fc3d3ef66310537c0bc6810a90af551a63e
Author: Yang Shi <[email protected]>
Date:   Tue Jul 14 14:28:17 2020 -0700

    Merge content from numpy.mxnet.io into mxnet official website (#18691)

commit 7f7e1c5a714262e8cd1015716258416e6ce1ff3e
Author: Serge Panev <[email protected]>
Date:   Tue Jul 14 14:12:00 2020 -0700

    Add better partial args/aux handling in symbol optimize_for (#18350)

    * Add missing args/aux support in optimize_for and deferred inference option

    Signed-off-by: Serge Panev <[email protected]>

    * Add input shape_dict, type_dict and stype_dict to optimize_for

    Signed-off-by: Serge Panev <[email protected]>

    * Remove warnings for Werror

    Signed-off-by: Serge Panev <[email protected]>

    * Address PR comments

    Signed-off-by: Serge Panev <[email protected]>

commit 9d623926d4857a2cfa32515b58cd1398371f97f3
Author: Yang Shi <[email protected]>
Date:   Mon Jul 13 15:54:51 2020 -0700

    Fix python micro-site table of content bugs (#18664)

    * update footer style

    * add compiled css of footer styles changes

    * add same style for footer2

    * more fix to the toc

commit 8ebb5372c3ad414cde096fb82de8be14cb748b11
Author: Sheng Zha <[email protected]>
Date:   Mon Jul 13 13:17:12 2020 -0700

    add 'needs triage' label to new bug reports (#18696)

commit 9c5b95a9c5d6f83a067504fb47fac4e3aed27e81
Author: Serge Panev <[email protected]>
Date:   Mon Jul 13 11:45:29 2020 -0700

    Partition API adding and deleting new params to Block and Symbol (#18405)

    * Add deleting of args aux aux to Partition API

    Signed-off-by: Serge Panev <[email protected]>

    * Delete args from Block.params

    Signed-off-by: Serge Panev <[email protected]>

    * Fix to use arg/auxdict when optimize_for is called in HybridBlock

    Signed-off-by: Serge Panev <[email protected]>

    * Address PR comments

    Signed-off-by: Serge Panev <[email protected]>

commit 19e373daac76b466cf11b5d31fa5d5e2eb518a21
Author: Leonard Lausen <[email protected]>
Date:   Sat Jul 11 09:09:51 2020 -0700

    Fix scipy dependency in probability module (#18689)

    * Fix scipy dependency in probability module

    * Fix copy-paste error

    * dtype='float32' for digamma and gammaln

commit a9b16f7024878611b236c9f3734ccd37a5a35d38
Author: JackieWu <[email protected]>
Date:   Sat Jul 11 02:59:21 2020 +0800

    change bn test (#18688)

commit beafba76395e75c093f99d20ac62e38f48e91012
Author: JackieWu <[email protected]>
Date:   Thu Jul 9 08:01:35 2020 +0800

    [Improvement] Invoke mkldnn and cudnn BatchNorm when axis != 1 (#18504)

    * fix batch norm when fix_gamma is True

    * support gradient accumulation for batch norm

    * mkldnn batchnorm support grad add

    * unittest for bn

    * fix bn arg

    * fix lint

    * fix mkldnn

    * fix mkldnn bn

    * fix grad when fixing gamma

    * fix naive gpu bn

    * fix lint

    * invoke mkldnn and cudnn batchnorm when axis != 1

    * backport 18500

    * change condition

    * fix

    * fix

    * add mkldnn_off for bn

    * remove mkldnn_off

    * recover save_000800.json

    * cast

commit 348ab4d8d77359bf60d97a0befbd9086fd52ee49
Author: Yang Shi <[email protected]>
Date:   Tue Jul 7 15:06:34 2020 -0700

    fix broken installation widget - remove empty entries (#18661)

commit b4b8b805fe94a6df905c6eae7f6c1f83cfea9b73
Author: Xi Wang <[email protected]>
Date:   Wed Jul 8 01:22:05 2020 +0800

    Gluon.probability (#18403)

    * package created

    * mvn WIP

    * normal wip, to be tested

    * update

    * docstring added, normal mostly done

    * add test file

    * Bernoulli WIP

    * bernoulli wip

    * bernoulli doc done

    * dense variational WIP

    * add kl infra

    * implement normal kl method

    * refactor kl

    * add not implemented handling, rename kl_storage

    * add  abstract method and Categorical class

    * rewrite logit2prob prob2logit for multiclass support

    * normal broadcast_to implemented

    * categorical mostly done

    * update distributions/utils.py

    * add dot ahead of import

    * fix normal F

    * bernoulli, normal brief tests implemented

    * add hybridize tests

    * transformation infras done

    * affine transformation, implemented tested

    * add tests cases

    * add sum_right_most

    * fix get F bug

    * compose transform implemented, tested

    * fix

    * add event_dim

    * fetch mvn from upstremm

    * clean code, implement normal cdf and tests

    * constraint in bernoulli done

    * fix constraint

    * finish half normal

    * add cached_property

    * add test on cached_property

    * add more features to distribution and constratins

    * change constraint

    * fix bernoulli

    * add independent

    * add independent tests

    * update naming of cached_property

    * revert

    * add constraints

    * add Cat

    * add Stack for imperative mode

    * add Stack for imperative mode

    * add bernoulli entropy

    * categorical WIP

    * categorical sampling implemented

    * finish categorical log_prob, sampling

    * enumerate_support finished

    * polish StochasticBlock, add test

    * add test for stochastic sequential

    * clean loss list in __call__

    * fix affine, implement sigmoid, softmax

    * add gumbel, relaxed bernoulli

    * relaxed one-hot sampling implemented

    * gamma done

    * gamma, dirichlet implemented

    * beta done

    * gumbel softmax log-likelihood implemented

    * refactor tests, implement exponential, fix compose transform

    * weibull implemented, transformed distribution cdf icdf added

    * pareto implemented

    * uniform wip

    * uniform done

    * rewrite lgamma, implement chi2

    * fix chi2 scale

    * F distributiion done

    * t implemented

    * fix tiny problem

    * cauchy done

    * add half cauchy

    * multinomial done, tests to be added

    * add multinomial test

    * MVN done, tests todo

    * mvn polished

    * fix a few precison issues

    * add erf, erfinv unified api and learnable transform

    * fix mvn attribute check

    * MVN done

    * poisson done

    * hack poisson for size support

    * geometric finished

    * negative binomial done

    * binomial done

    * implement some kl

    * add more kl

    * refactor kl test

    * add more kl

    * binomial kl todo

    * change constraint logical op implement

    * implement gamma entropy

    * finish beta dirchlet entropy

    * finishi all entropy

    * kl finished

    * add constraint test

    * domain map done

    * remove bayesian dense

    * fix tiny problems

    * add kl uniform normal

    * add kl tests

    * acquire patch from upstream

    * add some doc

    * finish doc

    * refactor kl test(WIP)

    * add more kl, fix float32 underflow issue

    * make sampling more stable

    * handle inconsistent mode

    * replace boolean idx with np.where

    * fix file name

    * add more doc

    * add constraint check

    * add half_normal/cauchy pdf cdf support check

    * fix import problem

    * change nosetest to pytest

    * remove buggy lines

    * change alias register path

    * attempt to fix ci

    * fix lint, change a few tests

    * fix lint

    * modify hybrid sequential

    * fix lint

    * change import order

    * add test gluon probability v2

    * fix hybridize flag

    * change implementation of stochastic block

    * fix lint

    * fix comments

    * fix block

    * modify domain map

    * add raises for improper add_loss

    * add raises for improper add_loss

    * add extra cases

    * change collectLoss decorator to mandatory

    * skip stochastic block tests

    * remove test cases

    * put gpu tests back

    * add test_gluon_stochastic_block back

    * remove export test

    * put a test back

    * tiny refactor

    * add memory leak flag

    * small changes

    Co-authored-by: Zheng <[email protected]>

commit 54c0155b7581f5e10b1469a17ddf127d3c75e156
Author: Yang Shi <[email protected]>
Date:   Mon Jul 6 17:01:42 2020 -0700

    User Feedback Widget (#18639)

    * user feedback widget implementation

    * add user feedback widget to python docs site

    * update margin

    * add apache license

    * one more license

    * turn off feedback widget on python site

    * update copy

    * format

    * add event value field

    * turn on widget on Python site

commit 646288716cbba482d4ede0fb4f6141b2ea505090
Author: Yiyan66 <[email protected]>
Date:   Sat Jul 4 09:13:41 2020 +0800

    [numpy] Fix less/greater bug with scalar input (#18642)

    * fix ffi

    * fix less/greater error

    * back

    * submodule

    * fixed

    Co-authored-by: Ubuntu <[email protected]>

commit d1b0a09669d1fa17b12a9acee887672d1e621523
Author: Yiyan66 <[email protected]>
Date:   Fri Jul 3 15:10:55 2020 +0800

    [numpy] FFI flip, rollaxis, stack (#18614)

    * flip

    * rollaxis

    * stack

    * fixed

    * retrigger ci

    Co-authored-by: Ubuntu <[email protected]>

commit c519e0e2db54fb8ad74e0e44d586720bf4023490
Author: Leonard Lausen <[email protected]>
Date:   Thu Jul 2 18:21:08 2020 -0700

    Mark test_get_symbol as garbage_expected (#18595)

commit d1b2cd9d8ada39ab4f16caff4ac43337476f2efc
Author: Leonard Lausen <[email protected]>
Date:   Thu Jul 2 18:20:48 2020 -0700

    build.py --no-pull (#18589)

    Add --no-pull option which disables overwriting the local docker cache based on CI docker cache. It is useful when locally changing Dockerfiles.

commit 0c8b6b2405e8313db3cf1a6f374a945d3c871b26
Author: Yang Shi <[email protected]>
Date:   Thu Jul 2 13:15:54 2020 -0700

    Clipboard refactor (#18605)

    * refactor clipboard

    * make lang getter more extensible

    * trigger ci

commit a8c8dea67593df7f1d2061893dddfdeee4750d9f
Author: Tao Lv <[email protected]>
Date:   Wed Jul 1 22:53:54 2020 +0800

    update to onednn v1.4 (#18273)

commit 9a122cac5e1317ccca2dea6884253ce32ac3671a
Author: bgawrych <[email protected]>
Date:   Wed Jul 1 16:43:06 2020 +0200

    Fix softmax, logsoftmax failed on empty ndarray (#18602)

    * Fix failing empty array (log_)softmax

    * Modify test for npx (log_)softmax

commit 37bed6e3af794624d651e888101eceb30c27c001
Author: Andrzej Kotłowski <[email protected]>
Date:   Wed Jul 1 16:39:22 2020 +0200

    Fix BatchNorm backward synchronization (#18644)

    * Add test for BatchNorm running variables synchronization

    * Fix BatchNorm backward synchronization

    It fixes issue #18610

commit 21581060d2f967cc2faeb5a76979cdffbf578657
Author: XIAO-XIA <[email protected]>
Date:   Tue Jun 30 14:16:20 2020 +0800

    [Numpy] FFI: tril_indices (#18546)

    * add numpy tril_indices ffi

    * Update src/api/operator/numpy/np_matrix_op.cc

    Co-authored-by: Haozheng Fan <[email protected]>

    Co-authored-by: Haozheng Fan <[email protected]>

commit 638622f37dcc4ef4b36dcabfd3d7a695fdb7d4c9
Author: Rohit Kumar Srivastava <[email protected]>
Date:   Mon Jun 29 14:36:42 2020 -0700

    Improve performance of broadcast_axis on CPU (#17882)

    * adding comments explaining code optimizations

    * fixing broadcast_axis kernel to int32

    * fixing slice_axis kernel to int32

    * combining CPU and GPU implementation method signatures and cleaned up
    code

    * adding new broadcast_axis to np_matmul

    Co-authored-by: Rohit Kumar Srivastava <[email protected]>

commit becb9ca694f51fdc0583d58429ccc943e6462810
Author: Sheng Zha <[email protected]>
Date:   Mon Jun 29 12:16:16 2020 -0700

    Remove mention of nightly in pypi (#18635)

commit b12abbfb356be93f8c24d427c72448f91d1980ec
Author: ciyong <[email protected]>
Date:   Mon Jun 29 11:14:34 2020 +0800

    Enhance license checker to cover multiple license header and md files (#18633)

commit d6c35785a870ac6e0b42903d7e27de2c9a6efdbe
Author: Shuai Zheng <[email protected]>
Date:   Sat Jun 27 13:25:03 2020 -0700

    Add LANS optimizer (#18620)

    * add lans optimizer

    * fix

    * fix

    Co-authored-by: Zheng <[email protected]>

commit 8ee460077b8e8f2d7a1dd96efca1751fc337cb63
Author: Yang Shi <[email protected]>
Date:   Fri Jun 26 11:22:15 2020 -0700

    fix contrib interleaved_matmul_selfatt_valatt not render correctly (#18621)

commit ecbda07c7bf8ce671744f0e9d361a1e8b5b744da
Author: Yang Shi <[email protected]>
Date:   Thu Jun 25 11:11:00 2020 -0700

    fix julia api redirect (#18613)

commit c9dcdd11853e8600879615c8d8be0aa5cdf851cf
Author: Yang Shi <[email protected]>
Date:   Thu Jun 25 11:02:09 2020 -0700

    add version check on installation guide (#18587)

commit e4c93e3e3a68559cb38e4ff92c9e0bf9c9cdd0bf
Author: Shuai Zheng <[email protected]>
Date:   Wed Jun 24 22:03:39 2020 -0700

    add epsilon to adamax (#18532)

    Co-authored-by: Ubuntu <[email protected]>

commit 3f555f850f4eef897bbafcb61df726491954ffbb
Author: Leonard Lausen <[email protected]>
Date:   Wed Jun 24 19:41:34 2020 -0700

    Update disclaimer wording (#18616)

commit 1fcc7ea8b8f5dfebd3f5440ffe9e0c7d4b13b90f
Author: RuRo <[email protected]>
Date:   Wed Jun 24 12:03:20 2020 +0300

    use new mxnet.gluon.block APIs (#18601)

commit acf2d27efe583ceb0f6b5253f0ac78ad6bf00e8e
Author: acphile <[email protected]>
Date:   Wed Jun 24 10:25:44 2020 +0800

    Update tutorials (#18609)

    Update docs according to new Block APIs (#18413)

commit 4b86c32832a994e76b97dfc58c8a672db87e721d
Author: mk-61 <[email protected]>
Date:   Tue Jun 23 13:49:06 2020 -0700

    Allow input reordering duing Gluon / CachedOp graph transformations (#17949)

    * Initial commit of input reordering in Gluon

    * Add test for Gluon input reorder

    * Fix backward in CachedOp for input reordering

    * Fix test_input_reorder for backward pass

    * Fix merge error in NaiveCachedOp

    * Include correct header for std::iota

    Co-authored-by: Vladimir Cherepanov <[email protected]>

commit 74fcb9938a14ec80f0c690b5a58a700537a621c5
Author: Yang Shi <[email protected]>
Date:   Mon Jun 22 18:54:05 2020 -0700

    redirect api reference on v-master to v1.6 (#18607)

    * redirect api reference on v-master to v1.6

    * update R docs

commit 56cfd9c272e81988682db6fde1b9205becc6a235
Author: Ram Rachum <[email protected]>
Date:   Mon Jun 22 21:23:04 2020 +0300

    Use chain.from_iterable in artifact_repository.py (#18578)

commit 2fbec60e0da8832d71f7e3f93d4407dbca745e51
Author: Haibin Lin <[email protected]>
Date:   Sun Jun 21 23:02:13 2020 -0700

    graph executor c api removal  (#18598)

    * add default ctx to cachedop fwd

    * add test

    * perl fix

    * initial commit

    * update sparse tests

    * add aux_states

    * fix aux-state type

    * fix some tests

    * fix check symbolic forwrad/backward

    * fix symbolic grad check

    * arg_dict fixes

    * support init ops

    * support forward only graph

    * fix check symbolic backward stype

    * add missing file

    * replace extension test bind

    * replace bind with _bind

    * simplify backward_mul implementation

    * small fix

    * drop contrib.sparseembedding

    * remove simple_bind in test sparse ops

    * use simple_bind

    * replave simple bind in quantization

    * fix aux index

    * update amp simple_bind calls

    * drop ifft

    * fix a bug found in subgraph op

    * add aux_array method

    * replace symbols

    * minor fix

    * fix executor default context

    * fix import

    * bug fix for nd.where

    * add subgraph test

    * fix forward grad req

    * fix batch dot dtype

    * remove unused code

    * fix slice dtype

    * fix attach grad

    * remove tests for non-existing sparse ops

    * MXCachedOpGetOptimizedSymbol

    * fix foreach test

    * enhance err msg

    * skip failed test

    * add docs

    * add docs

    * fix lint

    * fix lint, remove quantization

    * fix lint

    * fix lint

    * fix lint

    * fix build and import

    * fix import

    * remove scala, R, julia, perl bindings

    * remove cpp, matlab bindings

    * fix perl call

    * fix test

    * remove perl binding

    * remove reshape test

    * fix profiler, trt

    * remove tensorrt test

    * remove quantization tests

    * fix import

    * fix conflcit

    * fix lint

    * skip buggy test

    * remove clojure

    * remove executor c api

    * remove amalgamation

    * fix build

    * move executor folder

    * fix import

    * fix lint

    * fix cpp pcakge

    * fix predict cpp

    * fix cpp make

    * remove jnilint

    * remove cpp package tset

    * remove julia test pipeline

    * disable numpy tests

    * disable compat test for delete

    Co-authored-by: EC2 Default User <[email protected]>
    Co-authored-by: Lin <[email protected]>

commit c1098aa33d6795f84a19601d0319d5bb8e19f317
Author: Haibin Lin <[email protected]>
Date:   Sat Jun 20 14:49:58 2020 -0700

    Switch to cached op in the testing suite (#18579)

    * add default ctx to cachedop fwd

    * add test

    * perl fix

    * initial commit

    * update sparse tests

    * add aux_states

    * fix aux-state type

    * fix some tests

    * fix check symbolic forwrad/backward

    * fix symbolic grad check

    * arg_dict fixes

    * support init ops

    * support forward only graph

    * fix check symbolic backward stype

    * add missing file

    * replace extension test bind

    * replace bind with _bind

    * simplify backward_mul implementation

    * small fix

    * drop contrib.sparseembedding

    * remove simple_bind in test sparse ops

    * use simple_bind

    * replave simple bind in quantization

    * fix aux index

    * update amp simple_bind calls

    * drop ifft

    * fix a bug found in subgraph op

    * add aux_array method

    * replace symbols

    * minor fix

    * fix executor default context

    * fix import

    * bug fix for nd.where

    * add subgraph test

    * fix forward grad req

    * fix batch dot dtype

    * remove unused code

    * fix slice dtype

    * fix attach grad

    * remove tests for non-existing sparse ops

    * MXCachedOpGetOptimizedSymbol

    * fix foreach test

    * enhance err msg

    * skip failed test

    * add docs

    * add docs

    * fix lint

    * fix lint, remove quantization

    * fix lint

    * fix lint

    * fix lint

    * fix build and import

    * fix import

    * fix perl call

    * fix test

    * remove perl binding

    * remove reshape test

    * fix profiler, trt

    * remove tensorrt test

    * remove quantization tests

    * fix import

    * fix conflcit

    * fix lint

    * skip buggy test

    Co-authored-by: EC2 Default User <[email protected]>
    Co-authored-by: Lin <[email protected]>

commit c1b96f562f55dfa024ac941d7b104f00e239ee0f
Author: Leonard Lausen <[email protected]>
Date:   Fri Jun 19 14:46:27 2020 -0700

    cmake: x86 options only on x86 and remove manual specification on CI (#18588)

    Use CMAKE_SYSTEM_PROCESSOR to detect target architecture and make x86 related
    options available only when compiling for x86. Remove the code turning these
    options manually off on CI.

    Remove ANDROID cmake option which was used to decide if -lpthread needs to be
    specified explicitly (on most Linux systems) or not (on Android). Instead
    auto-detect the behavior.

commit 041bd3016375c6bdadddc9e9f43655923ee739bf
Author: RuRo <[email protected]>
Date:   Fri Jun 19 21:56:05 2020 +0300

    [MXNET-889] Implement ONNX export for gluon LSTM. (#17734)

    * implement onnx translations for _full type nodes

    * implement onnx translations for _rnn_param_concat

    * implement onnx translations for RNN (LSTM mode)

    * implement node export unittest for gluon.LSTM

commit bf0753702b37cc932baf417be2af2e7abe034bab
Author: Manu Seth <[email protected]>
Date:   Fri Jun 19 10:20:55 2020 -0700

    Link GluonCV object detection tutorial for Jetson (#18530)

    * add object detection tutorial for Jetson

    * adding GluonCV in title

    * cross reference gluoncv turorial

commit cb54a4a99463b23b8abaa2629661954c4ba3c60b
Author: acphile <[email protected]>
Date:   Fri Jun 19 14:31:08 2020 +0800

    Simplify mxnet.gluon Block APIs (#18413)

    ## Motivations
    Currently the implementation of mxnet.gluon.block is not so pythonic and there are many redundancies

    ### 1. overlaps between Block._params and Block._reg_params
    when we want to self-define a model, we currently need to use the code as follows:
    ```
    class Net(nn.HybridBlock):
        def __init__(self, **kwargs):
            super(HybridNet, self).__init__(**kwargs)
            with self.name_scope():
                self.hidden1 = nn.Dense(256, activation='relu')
                self.a=self.params.get('a', shape=(1, ))
    ```
    There are several shortcomings when using this form of registration:
    a. adding parameter ‘a’ will lead to double recordings in both self._params and self._reg_params, which is a redundancy. And there is also a discrepancy in Block:
    &nbsp;&nbsp; &nbsp;&nbsp; i. In the method “collect_params”, we use “_params” to get all parameters
    &nbsp;&nbsp;&nbsp;&nbsp; ii. while in the method “_collect_params_with_prefix” (and methods “load_parameters” accordingly), we use “_reg_params” to get all parameters.
    b. Currently if we do not use “with self.name_scope():” for children blocks, it will lead to wrong name scopes. For the following example, we actually can not get the parameters of self.hidden1 from the result of collect_params
    ```
    class HybridNet(nn.HybridBlock):
        def __init__(self, **kwargs):
            super(HybridNet, self).__init__(**kwargs)
            self.hidden1 = nn.Dense(256, activation='relu')
            with self.name_scope():
                self.hidden2 = nn.Dense(10, activation='relu')

        def hybrid_forward(self, F, x):
            x = self.hidden2(self.hidden1(x))
            return x

    >>> net = HybridNet()
    >>> net.initialize()
    >>> print(net.collect_params())
    hybridnet0_ (
      Parameter dense0_weight (shape=(256, -1), dtype=float32)
      Parameter dense0_bias (shape=(256,), dtype=float32)
      Parameter hybridnet0_dense0_weight (shape=(10, -1), dtype=float32)
      Parameter hybridnet0_dense0_bias (shape=(10,), dtype=float32)
    )
    ```
    From the above example we can also find that the parameter names are not related to the attributes’ names, which is not straightforward.

    In all, we find that using name_scope and ParameterDict is not user-friendly. Thus we plan to remove such redundancies and simplify the definitions of children blocks and parameters, like:
    ```
    class Net(nn.HybridBlock):
        def __init__(self, **kwargs):
            super(HybridNet, self).__init__(**kwargs)
            self.hidden1 = nn.Dense(256, activation='relu')
            self.a=gluon.parameter.Parameter(name="a", shape=(1, ))
    ```

    ### 2. parameter sharing
    Currently, we use parameter “params” in the definition of Block for parameter sharing. It means before the __init__ of Block, shared parameters already recorded in self._params.shared. And currently Block forbids overriding parameters.
    We think that this is not convenient. A most common way to share parameter is like what Pytorch does, like
    ```
    self.hidden1.weight=self.hidden2.weight
    ```
    But note that in the case where we have a HybridBlock and the block has been hybridized, then we shouldn't allow overriding the parameter but ask the user to unhybridize the Block first.
    To further allow sharing parameters recursively, we plan to add an API:
    ```
        def share_parameters(self, params : Dict):
    ```
    We plan to use the structured based form (like what is used in “_collect_params_with_prefix()”) to represent each parameter recursively. For example, we denote “self.hidden1.weight” as “hidden_weight”

    In all, we plan to make the following improvements:

    1. remove parameters “prefix” and “params” in the “\_\_init\_\_" function.
    2. remove the use of self._params(ParameterDict) in Block
    3. allow parameter attribute overriding in non-hydridization case.
    4. add the method “share_parameters" to recursively share parameters in children blocks.

    ## Parameter naming
    Once a parameter is created, `param.name` would not be changed in the following operations. It is in the form of `param_{uuid4}_{name}`, where `name` is from `__init __` parameter. Here `name` is optional, default `weight`. It is mainly used to denote which default initialization should be used.
    We use `param.name` as the name of a parameter's symbol representation.
    ## collect_params()
    It returns a `dict`, where the keys are structural names of parameters, like
    `{'hidden1.weight': Parameter (shape=(3, -1), dtype=float32), 'hidden1.bias': Parameter (shape=(3,), dtype=float32)}`
    Note that we use `.` as the linking character again because the structured based naming scheme is no longer used in the symbol representation.

    ## Save and Load
    For `HybridBlock`, there are two ways to save and load parameters:
    ### save_parameters() and load_parameters()
    In `save_parameters()`, we use `structural name` to save parameters, and they should be loaded by `load_parameters()`, which loads parameters based on a model's structure.
    ### HybridBlock.export and SymbolBlock.imports
    In `export`, we only save parameters using `param.name` without `structural name`. The param file should be loaded in SymbolBlock.imports.
    ## SymbolBlock
    When using `SymbolBlock.imports`, keys in `self.param` would be the loaded parameters' names `param.name`.
    While in `SymbolBlock(outputs, inputs, params=None)`, if you provide like `params=net.collect_params()`,  keys in `self.param` would be structural names of `net`'s parameters (keys in net.collect_params() ). It is often used in this situation that a `SymbolBlock` is a children block of another `HybridBlock`. Otherwise, keys in `self.param` would be the loaded parameters' names `param.name`.

commit 55856066b4b6242f233cc31da8970c91f06d4bc0
Author: ciyong <[email protected]>
Date:   Fri Jun 19 06:23:07 2020 +0800

    Add KEY for Ciyong Chen (#18577)

commit e96fbeb3adb78d4300f5f10cc22531583914e590
Author: Leonard Lausen <[email protected]>
Date:   Thu Jun 18 15:20:14 2020 -0700

    Update cmake/upstream/FindCUDAToolkit.cmake (#18528)

    Previously MXNet includes a hotfix for a cross-compiling bug in upstream FindCUDAToolkit.cmake. Upstream has fixed the bug now in their master branch. Replace MXNet's fix by the upstream fix to avoid diverging from upstream.

    See https://gitlab.kitware.com/cmake/cmake/-/issues/20572

commit 14aeb384a51c9e420c349f42cea001f0a5ef5dfe
Author: RuRo <[email protected]>
Date:   Fri Jun 19 01:16:12 2020 +0300

    Add parameter name to AssertionError for deferred shape inference (#18537)

commit 9591436967347cc8e34a01e126b696b3447f8081
Author: Johannes Czech <[email protected]>
Date:   Thu Jun 18 07:33:08 2020 +0200

    [Numpy] Bugfix of slice operator export (MXNet to ONNX) v2 (#18535)

    * fixed get_inputs() for onnx slice operator export

    * added unit test for onnx slice operator export

    * implement get_inputs with_shapes helper

    * update slice ops to use with_shapes

    * added verbose parameter for get_outputs()

    Co-authored-by: Andrey Stotskiy <[email protected]>

commit 92971b822dd0151aadba965c0c6b8b22cb82bf76
Author: Neutron3529 <[email protected]>
Date:   Thu Jun 18 13:30:10 2020 +0800

    fix misbehave of KLDivLoss (#18423)

    * fix misbehave of KLDivLoss

    In the current version of KLDivLoss, the return value is not the same value calculated by SoftmaxCrossEntropyLoss, which is not documented. It may due to the incorrect settings which using mean rather than sum dealing with the return value.
    I provide a fix of this setting, which will keep the return value of `KLDivLoss` and SoftmaxCrossEntropyLoss` almost the same when `from_logits=False` and `sparse_label=False` are set to these functions seperately.
    Now, the behave of KLDivLoss is exactly the same to what the document say.
    ```
    import mxnet as mx
    a=mx.nd.array([[-1,1],[1,-1]])
    b=mx.nd.array([1,0]).one_hot(2)
    TrueLoss=mx.gluon.loss.SoftmaxCrossEntropyLoss(sparse_label=False)
    FalseLoss=mx.gluon.loss.KLDivLoss(from_logits=False)
    c=TrueLoss(a,b)
    d=FalseLoss(a,b)*a.shape[-1]
    assert((c-d).abs().sum()==0 and a.shape[-1]==2)
    ```

    * update sdml loss

    the current version of SDMLLoss told us to `multiply for the number of labels` but actually it `multiply batch_size`. After this PR, it is no need to `multiply batch_size` or `multiply the number of labels` any more.

    * remove outdated comment

commit b9118d9bfa0b34307c53456ea6af3927e57b8635
Author: Yang Shi <[email protected]>
Date:   Wed Jun 17 13:00:04 2020 -0700

    fix contribute page anchor position shifted (#18571)

    Co-authored-by: Yang Shi <[email protected]>

commit eddd27d375ee403a026e3262264485c83161787f
Author: Yang Shi <[email protected]>
Date:   Wed Jun 17 11:59:41 2020 -0700

    add FAQ redirect rules (#18552)

    Co-authored-by: Yang Shi <[email protected]>

commit 103d839aa8477419ddc82f09e2ddb246e24a8d3d
Author: Manu Seth <[email protected]>
Date:   Tue Jun 16 16:52:46 2020 -0700

    Test CD mxnet_lib/static and python/pypi stages on CI (#18559)

    * add cd mxnet_lib/static stages to ci

    * add cd pypi packaging stage to ci

    * removing existing cmake static compile stages in favor of other added stages

    * pass mxnet_variant correctly

commit 8039377e6630bcb00c5a95abdaf0851803686bc6
Author: JiangZhaoh <[email protected]>
Date:   Wed Jun 17 01:45:30 2020 +0800

    add op npx.index_update (#18545)

    * add op npx.index_update

    * remove debug comment

    * change eps

    * fix stupid error

    * add blank line in docs

    * gpu temporary space request alignment

    * fix test error

    Co-authored-by: Ubuntu <[email protected]>

commit 72a54e7a5f427dc73fbd1cb826ff944d9aa82573
Author: andevellicus <[email protected]>
Date:   Mon Jun 15 22:13:13 2020 -0400

    Julia: fix deprecation in visualize.jl (#18515)

    * Update visualize.jl

    matchall has been deprecated as of Julia 1.3. Changes made to fix.

    * Cleaned

    * Update julia/src/visualize.jl

    * Update julia/src/visualize.jl

    Co-authored-by: Iblis Lin <[email protected]>

commit e8fce62b369dac627dec23d730661624ec79b957
Author: Manu Seth <[email protected]>
Date:   Mon Jun 15 18:42:51 2020 -0700

    Skip flaky test_gpu_memory_profiler_gluon on cd pipeline (#18565)

commit 1b02225fefd8ccc93bc73223f0d3cde103fad661
Author: Chaitanya Prakash Bapat <[email protected]>
Date:   Mon Jun 15 11:45:03 2020 -0700

    Add comments to init.py (#18327)

commit cc6c64909afd78c6b5b63ee1215922e8da589c20
Author: Chaitanya Prakash Bapat <[email protected]>
Date:   Mon Jun 15 08:55:14 2020 -0700

    [OpPerf] Add example of using opperf with internal op locally (#18324)

    * add example of using opperf with internal op locally

    * split diff to old and new code for readability

    * mx.nd.copyto doesnt exist & website title shows ndarray instead of symbol

    * Revert "mx.nd.copyto doesnt exist & website title shows ndarray instead of symbol"

    This reverts commit 118b0900a58586aca84ec5c853d00cf687615853.

commit af1b45ba3590b21014c55c58838c3e04b3f2cea3
Author: Chaitanya Prakash Bapat <[email protected]>
Date:   Sun Jun 14 22:45:57 2020 -0700

    Create config.yml (#18553)

    Add options for stackoverflow and discuss to issue_template & disable blank issue

commit da252734c70164a0983404de076464ba7a526a60
Author: Manu Seth <[email protected]>
Date:   Sat Jun 13 18:30:29 2020 -0700

    remove dependency on train_mnist.py script (#18550)

    * remove dependency on train_mnist.py script

    * remove image classification tests from nightly

commit 09cf48a24682e308b552a7fa70a816c024308438
Author: Leonard Lausen <[email protected]>
Date:   Sat Jun 13 16:31:59 2020 -0700

    Use correct array type for outputs in HybridBlock.forward (#18554)

commit f1f3f44166e2e47afad6c65025fb48dd47efeb65
Author: Haibin Lin <[email protected]>
Date:   Sat Jun 13 10:10:25 2020 -0700

    Remove the deprecated BatchNorm_v1 op (#18538)

    * remove batchnorm_v1

    * fix gpu build

    Co-authored-by: EC2 Default User <[email protected]>
    Co-authored-by: Lin <[email protected]>

commit 97d4ba5a133f93ff6075dcde3ef842b23d498a12
Author: Haibin Lin <[email protected]>
Date:   Fri Jun 12 16:52:47 2020 -0700

    Remove XXOutput loss operators  (#18531)

    * remove xxOutput operators used in Module

    * remove SVMOutput

    * remove RegressionOutput in language binding

    * remove more examples

    * fix scala, perl

    * remove spark examples

    * remove softmaxoutput op

    * remove more tests

    * remove more SoftmaxOutput related code

    * remove MAERegression

    * remove symbol.Softmax

    * fix perl test count

    * fix failing tests

    * remove mlp cpu test

    * fix scala test

    * remove tests/examples relying on imagenet-1k pretrained symbolic models

    * fix scala build

    * remove MultiTaskSuite for scala

    * fix cpp build

    * fix scale, clojure test

    * fix scala and python test

    * fix scala and clojure test

    * remove clojure test

    * remove clojure test

    * remove test_forward for python

    * remove clj viz test

    * remove viz tests

    * remove clj tutorail test

    * remove bert test

    * remove clj tests

    * remove clj multi-label test

    * remove module mlp test for clh

    * remove module test for clj

    * rm ./contrib/clojure-package/test/org/apache/clojure_mxnet/ndarray_api_test.clj

    * remove clj tests

    * rm test_mkldnn_model

    Co-authored-by: EC2 Default User <[email protected]>
    Co-authored-by: Lin <[email protected]>

commit 1bf881f381f91b157a26d9beddcaa8f4960cc038
Author: Yang Shi <[email protected]>
Date:   Thu Jun 11 14:01:17 2020 -0700

    Fix Slow Site Loading Speed part2 (#18512)

    * host JQuery locally

    * defer time consuming scripts

    * defer more render-blocking script

    * move general version dropdown css from head to scss

    * update quotation mark

    * add cache control

    * add licenses info to jquery

    * remove jquery from github

    # Conflicts:
    #	docs/static_site/src/assets/js/jquery-3.3.1.min.js

    * load jquery based on env

    * update wget jquery command

    Co-authored-by: Yang Shi <[email protected]>

commit a361f33497c8e87a4eab48a666fcb4a586a607b1
Author: Manu Seth <[email protected]>
Date:   Thu Jun 11 09:17:44 2020 -0700

    revert changes causing cd failures (#18533)

    Reverting the following changes to cd_unittest_ubuntu causing CD pipeline failures:

        The first change was using Naive Engine for operator tests, which causes timeout failures in CD
        Added here: 10b6b48

        Second change was running integrationtest_ubuntu_gpu_byteps as part of cu* CD tests, added here: e28e9fe

commit 743bbcbc7c8c85661a146d94ebd3196306650677
Author: Yijun Chen <[email protected]>
Date:   Thu Jun 11 23:22:56 2020 +0800

    unify impl (#18523)

commit fb73de7582de4e622299a4ad045e25f771568193
Author: Haibin Lin <[email protected]>
Date:   Wed Jun 10 19:54:25 2020 -0700

    remove mx.module.* APIs for MXNet 2.0 (#18525)

    * remove Module tests

    * remove APIs relying on module

    * remove docs and tools using mx.module

    * remove executor manager

    * remove ssd and ncf examples

    * add back grad compression api doc

    * fix lint

    * add back cpredict exmaple

    * fix resnet memory test

    * remove tests

    * remove tests/python/tensorrt/test_tensorrt_lenet5.py since it depends on a model traiend by mx.Module

    * skip flaky test

    * fix quantization test

    * remove subgraph tests

    Co-authored-by: EC2 Default User <[email protected]>
    Co-authored-by: Lin <[email protected]>

commit 26f44b71d8de84bbc88af496ae0aeb7ce535312d
Author: Serge Panev <[email protected]>
Date:   Wed Jun 10 10:41:50 2020 -0700

    Add backward Type inference to main NN operators (#18378)

    * Add backward Type inference to main DNN operators

    Signed-off-by: Serge Panev <[email protected]>

    * Add comments

    Signed-off-by: Serge Panev <[email protected]>

commit b6b40878f0aba2ba5509f3f3a4cd517a654847ce
Author: Leonard Lausen <[email protected]>
Date:   Tue Jun 9 22:05:16 2020 -0700

    Consolidate installation instructions on website and add disclaimer for non-ASF ressources (#18487)

    * Update website with disclaimer for non-ASF ressources

    * Integrate Windows instructions to build_from_source.md

    * Remove master version from selector

    * Update Download links

    * Update get_started/download.md per Release Download Page policy

commit cf3984bf5c67cb7d1feeb5b3cb55a41ca995e5c8
Author: Yiyan66 <[email protected]>
Date:   Wed Jun 10 05:56:13 2020 +0800

    [numpy] fix op repeat with list input (#18371)

    * except .h

    * except storage

    * repeat

    * change fwd

    * delete

    * codecov

    Co-authored-by: Ubuntu <[email protected]>

commit 028d01d5fb4867988a5ca50634562c1f4e75ca6f
Author: Sam Skalicky <[email protected]>
Date:   Mon Jun 8 10:42:09 2020 -0700

    Drop list support in optimize_for (#18483)

    * initial commit

    * fixed typos

    * changed warning to exception

    * updated subgraph_op unittests

commit 2d58ff5512e27e7a12ae9c9335d2554ee0b2bc1f
Author: JackieWu <[email protected]>
Date:   Tue Jun 9 01:41:35 2020 +0800

    [Bug Fixed] Fix batch norm when grad_req is `add` (#18500)

    * fix batch norm when fix_gamma is True

    * support gradient accumulation for batch norm

    * mkldnn batchnorm support grad add

    * unittest for bn

    * fix bn arg

    * fix lint

    * fix mkldnn

    * fix mkldnn bn

    * fix grad when fixing gamma

    * fix naive gpu bn

    * fix lint

    * fix cudnn bn

    * fix flag

    * fix lint

    * fix testcase

    * fix

    * use @pytest.mark.parametrize

    * combination

    * remove redundant test in batchnorm

    * npx.batch_norm test

    * try to fix test

    * reduce the number of tests for batchnorm

    * fix

commit 992ed3c1ea449fdb1f4f7010dfd05d00ae88a020
Author: Haibin Lin <[email protected]>
Date:   Mon Jun 8 10:39:56 2020 -0700

    remove mx.rnn APIs (#18507)

    * remove mx.rnn APIs

    * fix test

    * update test

    Co-authored-by: Ubuntu <[email protected]>
    Co-authored-by: Lin <[email protected]>

commit e3493e7b47ddcaa6974280ee432c82eb89d0f756
Author: Haibin Lin <[email protected]>
Date:   Sun Jun 7 18:20:46 2020 -0700

    remove tools dependent on mx.module APIs (#18508)

    * remove tools depending on mx.module

    * remove caffe converter and coreml tools

    Co-authored-by: Lin <[email protected]>

commit 5df002567dd2e9ebcfeb620a9ba55adbded743da
Author: Przemyslaw Tredak <[email protected]>
Date:   Fri Jun 5 19:55:06 2020 -0700

    Fix race condition in FusedOp (#18498)

commit a1db5b29451938e84ade0e768c3b93b8fd71ad15
Author: Leonard Lausen <[email protected]>
Date:   Fri Jun 5 16:40:22 2020 -0700

    Update .codecov.yml (#18497)

commit 644b69d01e5b037c3d7b0bd61d282f406c01b759
Author: Mosalam Ebrahimi <[email protected]>
Date:   Fri Jun 5 13:52:01 2020 -0700

    Fix typo (#18496)

commit deae9b88c1724e056a4e7dc21f04b58c28304111
Author: RuRo <[email protected]>
Date:   Fri Jun 5 23:18:16 2020 +0300

    Fix tests for ONNX version 1.5.0 bump (#18054)

    * implement onnx translation helpers

    * bump onnx version to 1.5

    * add export only test cases for topk and slice_axis

commit 4be095500de74ff95ed18ebdf695eae171375818
Author: ciyong <[email protected]>
Date:   Sat Jun 6 03:44:04 2020 +0800

    Julia: remove downloading of the non-ASF binary build (#18489)

commit 24d88a2cdec3e0ab8f4fe0e436eb0015e9ccfd47
Author: Manu Seth <[email protected]>
Date:   Fri Jun 5 09:45:31 2020 -0700

    Update Jetson installation guide (#18485)

    * add config Makefile for jetson

    * modify jetson install guide

commit 7054e42c0786a2b8223b5183b852f68e72822a76
Author: Manu Seth <[email protected]>
Date:   Fri Jun 5 09:40:44 2020 -0700

    Add image classification tutorial for jetson (#18434)

    * add image classification tutorial for jetson

    * update code to use gluon model zoo; update doc

    * referencing MXNet official website for Jetson installation guide

commit a156ed8e37e17f79cf0383dd9b0e1427309ad127
Author: Yang Shi <[email protected]>
Date:   Fri Jun 5 09:38:02 2020 -0700

    Revert installation dropdown change (#18488)

    This broke the version selector.

    Co-authored-by: Yang Shi <[email protected]>

commit b07152244c311b9270b448b6629f8ae470f3fab1
Author: Leonard Lausen <[email protected]>
Date:   Thu Jun 4 17:44:52 2020 -0700

    Update website instructions for compiling for / on Raspberry Pi. (#18472)

    * Update ci/README.md

    * Update raspberry pi instructions

commit e28e9fec9bba07708ed0093c882b8070a96dfdd5
Author: Haibin Lin <[email protected]>
Date:   Thu Jun 4 14:20:52 2020 -0700

    BytePS trainer + tests (#18032)

    * [MXNET-#16795] Byteps-KVStore: Intergrate Byteps into mxnet as new type of kvstore backend (#17555)

    * Add Byteps backend for kvstore

    * Add a temp launcher for byteps backend

    * make the test fit for byteps kvstore.

    * final workable test

    * Remove trashy print and logs

    * correct comment

    * add hostfile for ci test

    * add ci test for byteps kvstore

    * add visibile devices for byteps-kvstore ci test

    * add licenses for tools/byteps_launcher.py

    * syntax error

    * pylint error (remove unused import like logging)

    * pylint error

    * pylint error

    * enable launching without hostfile (local byteps)

    * 1. rename byteps_kvstore.py to byteps.py; 2. shorten the launch option  to ; 3. add instruction for -H and -SH options for launch; 4. add documentation for byteps kvstore in kvstore/base.py: create(name='local')

    * edit documentation of KVStoreBase::is_capable(capability); reture fasle for BytePS(KVStoreBase):is_capable(any).

    * pylint error

    * remove an error of arg.byteps

    * use --env option to set workers' environment

    * error in byteps-launcher.py

    * remove the unpurposed editing mistake in runtime_functions.sh

    * disable cpu support for byteps kvstore.

    * 1. format the document to avoid julia doc build error;
    2. little change to nightly test;
    3. add byteps copy right declararation in byteps_launcher.py
    4. if args.byteps == True ===> if args.byteps

    * remove the --scheduler_ip and --scheduler_port options in launch.py

    * 1. maintain the origin value of broadcast and pushpull
    2. optimize when out = value or [out]=value
    3. add some missing documentation to avoid doc building error.

    * Add bytePS to CI

    * add dependency

    * +integrationtest_ubuntu_gpu_byteps

    * add byteps pipeline

    * disable a few tests

    * remove more tests

    * fix permission

    * remove apt-get

    * fix python path

    * improve logging

    * fix printns

    * add back CI

    Co-authored-by: Ubuntu <[email protected]>
    Co-authored-by: Piyush Ghai <[email protected]>
    Co-authored-by: eric-haibin-lin <[email protected]>
    Co-authored-by: eric-haibin-lin <--global>
    Co-authored-by: Lin <[email protected]>

    * fix byteps logging and declare tensor

    * check exceptions and return -1

    * print logging in CI

    * Update byteps.py

    * Update runtime_functions.sh

    * add numa dependency

    * pin dependency

    * Update runtime_functions.sh

    * Update Dockerfile.build.ubuntu

    * Update runtime_functions.sh

    * Update runtime_functions.sh

    * Update runtime_functions.sh

    * Update runtime_functions.sh

    * Update Jenkins_steps.groovy

    * remove launcher. use bpslauncher instead.

    Co-authored-by: Chaokun Chang <[email protected]>
    Co-authored-by: Ubuntu <[email protected]>
    Co-authored-by: Piyush Ghai <[email protected]>
    Co-authored-by: Lin <[email protected]>
    Co-authored-by: Ubuntu <[email protected]>
    Co-authored-by: EC2 Default User <[email protected]>
    Co-authored-by: Ubuntu <[email protected]>

commit 7cc6700fdd5e9f6837389155b63c2911652d2c91
Author: Yang Shi <[email protected]>
Date:   Thu Jun 4 13:29:08 2020 -0700

    Add Developer Guide Docs to MXNet Website (#18474)

    * init dev guide

    * move dev guide above FAQ

    * update format and images

    * hoist git docs and fix styles

    * use relative urls

    * remove useless code block

    * use consistent url and file name

    * update heading

    * add apache license header

    * init dev guide

    * move dev guide above FAQ

    * update format and images

    * hoist git docs and fix styles

    * use relative urls

    * remove useless code block

    * use consistent url and file name

    * update heading

    * add apache license header

    * update doc - git clone recursive

    * reviewing the dev guide - proof reading and text edits

    Co-authored-by: Yang Shi <[email protected]>
    Co-authored-by: Talia Chopra <[email protected]>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 23, 2020
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 23, 2020
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 24, 2020
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 27, 2020
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
access2rohit added a commit to access2rohit/incubator-mxnet that referenced this pull request Jul 28, 2020
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
szha pushed a commit that referenced this pull request Jul 28, 2020
* Improving performance of broadcast_axis on GPU (#18168)

* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

* Improve performance of broadcast_axis on CPU (#17882)

* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
chinakook pushed a commit to chinakook/mxnet that referenced this pull request Nov 17, 2020
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants