Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

gluon improvement #8152

Closed
wants to merge 1 commit into from
Closed

gluon improvement #8152

wants to merge 1 commit into from

Conversation

szha
Copy link
Member

@szha szha commented Oct 5, 2017

changes include:

  1. use generator expr for arguments when calling cached op
  2. merge two separate for loops for assembling parameters into one so that the check for parameter membership in cargs is cached.
  3. fill shapes after parameter data is set, to be reflected in block's __repr__.

@szha szha requested a review from piiswrong October 5, 2017 03:45
@piiswrong
Copy link
Contributor

Construction is only called once

assert fmt == self._in_format, "Invalid input format"
for i, j in self._in_idx:
cargs[i] = args[j]
cargs = tuple(args[self._in_idx[i]] if self._in_idx[i] != -1 else v for i, v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple combination of the list comprehension and the for loop. What seems to be the problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tests passed.

Could you double check and see if still doesn't look right? If so, what looks like the error? And what other tests should be added to catch the error?

@piiswrong
Copy link
Contributor

What's the benefit of this change? This part is complicated and hard to debug, so unless there is visible performance impact I think there is no reason to change it

@mli
Copy link
Contributor

mli commented Oct 5, 2017

Agree that using tuple and iterator can reduce the memory. But Eric's point is valid, this PR puts too many things in a single expression (i.e. cargs = ...), which makes the codes harder to read.

maybe we can write a generator for cargs, which can further reduce memory, and break single line into multiple lines to improve readability.

also, is there any benchmark showing that this kinds of modification can improve things in a notiable way?

@szha
Copy link
Member Author

szha commented Oct 5, 2017

Thanks for the comments, I worked on this a bit more and simplified the assembly of calling arguments using one loop. I renamed the field for readability. And here are the timing results:

workload:

import mxnet as mx
from mxnet import gluon
class example(gluon.HybridBlock):
    def __init__(self, num_args, num_params):
        super(example, self).__init__()
        self._num_args = num_args
        self._num_params = num_params
        for i in range(num_params):
            setattr(self, 'param%d'%i, self.params.get('param%d'%i, shape=(1,), init='ones'))
    def hybrid_forward(self, F, *args, **kwargs):
        return F.concat(*(a for a in (args+tuple(kwargs.values()))), dim=0)

num_args, num_params = 2, 50
net = example(num_args, num_params)
net.initialize()
net.hybridize()

And run:

%timeit -n 100 net(*(mx.nd.ones((1,)) for _ in range(num_args))).wait_to_read()

Results:

num_args, num_params = 100, 200
# before
100 loops, best of 3: 5.52 ms per loop
# after
100 loops, best of 3: 5.37 ms per loop

num_args, num_params = 2, 50
# before
100 loops, best of 3: 333 µs per loop
# after
100 loops, best of 3: 312 µs per loop

Also, I'm caching the result of collect_params now so that checkpointing will be faster.

@szha szha changed the title memory save gluon cached op call improvement Oct 6, 2017
@szha szha force-pushed the gluon_tuple branch 4 times, most recently from aa2f6ff to e2f3c17 Compare October 7, 2017 21:10
@piiswrong
Copy link
Contributor

Check pointing is not a bottleneck and collecting param is unlikely a big cost during checkpointing.

This is over complicated. Optimization should be driven by profiling. Premature optimization is the source of all evil.

When a layer is changed, simply clear cached op.

@szha szha force-pushed the gluon_tuple branch 2 times, most recently from f304210 to 085b7cb Compare October 9, 2017 17:57
@szha
Copy link
Member Author

szha commented Oct 9, 2017

Reverted per request and ready for review.

@szha szha mentioned this pull request Oct 11, 2017
7 tasks
@szha szha force-pushed the gluon_tuple branch 2 times, most recently from b973920 to 4f7980e Compare October 15, 2017 05:21
leezu added a commit to leezu/mxnet that referenced this pull request Oct 16, 2017
leezu added a commit to leezu/mxnet that referenced this pull request Oct 16, 2017
@szha szha closed this Oct 16, 2017
@szha szha deleted the gluon_tuple branch October 16, 2017 04:41
@szha szha restored the gluon_tuple branch October 16, 2017 04:41
@szha szha reopened this Oct 16, 2017
cjolivier01 pushed a commit that referenced this pull request Oct 17, 2017
* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from #8152

* Add tests from #8152
cjolivier01 pushed a commit that referenced this pull request Oct 17, 2017
* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from #8152

* Add tests from #8152
cjolivier01 pushed a commit to cjolivier01/mxnet that referenced this pull request Oct 18, 2017
* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152
@szha szha changed the title gluon cached op call improvement gluon improvement Oct 22, 2017
@szha
Copy link
Member Author

szha commented Oct 22, 2017

added shape completion after loading parameters or finishing deferred init in this PR, since the logic for shape completion can benefit from the refactoring of _finish_deferred_init in this PR.

@cjolivier01
Copy link
Member

Premature optimization is the root of ALL evil? Wow. :)

@szha
Copy link
Member Author

szha commented Oct 22, 2017

yeah, thou shalt yield when Knuth is quoted.

cjolivier01 added a commit that referenced this pull request Oct 23, 2017
* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (#8125)

* v0.12 regression: Fix registration of children for Block (#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from #8152

* Add tests from #8152

* Revert "[CMAKE] Fix windows cmake build" (#8311)

* Revert "Added my code signing key (#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (#8300)

* Update rnn.md (#8320)

* fluent methods for missed ops (#8329)

* update ps lite (#8327)

* Fix unused type warning (#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (#8369)

* Allow test to converge (#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (#7988)

* [Perl] emulate Python zip() for Perl (#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (#8376)

Fix a typo in the example readme.
cjolivier01 added a commit to cjolivier01/mxnet that referenced this pull request Oct 23, 2017
* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (apache#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (apache#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (apache#8125)

* v0.12 regression: Fix registration of children for Block (apache#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152

* Revert "[CMAKE] Fix windows cmake build" (apache#8311)

* Revert "Added my code signing key (apache#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (apache#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (apache#8300)

* Update rnn.md (apache#8320)

* fluent methods for missed ops (apache#8329)

* update ps lite (apache#8327)

* Fix unused type warning (apache#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.
cjolivier01 added a commit to cjolivier01/mxnet that referenced this pull request Oct 23, 2017
* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (apache#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (apache#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (apache#8125)

* v0.12 regression: Fix registration of children for Block (apache#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152

* Revert "[CMAKE] Fix windows cmake build" (apache#8311)

* Revert "Added my code signing key (apache#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (apache#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (apache#8300)

* Update rnn.md (apache#8320)

* fluent methods for missed ops (apache#8329)

* update ps lite (apache#8327)

* Fix unused type warning (apache#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.
crazy-cat pushed a commit to crazy-cat/incubator-mxnet that referenced this pull request Oct 26, 2017
* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152
crazy-cat pushed a commit to crazy-cat/incubator-mxnet that referenced this pull request Oct 26, 2017
* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (apache#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (apache#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (apache#8125)

* v0.12 regression: Fix registration of children for Block (apache#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152

* Revert "[CMAKE] Fix windows cmake build" (apache#8311)

* Revert "Added my code signing key (apache#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (apache#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (apache#8300)

* Update rnn.md (apache#8320)

* fluent methods for missed ops (apache#8329)

* update ps lite (apache#8327)

* Fix unused type warning (apache#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.
cjolivier01 added a commit that referenced this pull request Oct 28, 2017
* Fill optimizations

* Optimize IdentityCompute for CPU

* lint

* Fix unused type warning (#8316)

* remove unused variable

* CR comments

* CR comments

* Added _full operator

* Trigger build

* Trigger build

* Add _full to symbolic

* Merge conflict resolution fix

* lint

* Timing output for test_factorization_module when Verbose enabled (#8363)

* Timing output for test_factorization_module when Verbose enabled

* Trigger build

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (#8369)

* Allow test to converge (#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (#7988)

* [Perl] emulate Python zip() for Perl (#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (#8376)

Fix a typo in the example readme.

* Use omp_get_max_threads() when OMP_NUM_THREADS environment variable is set (#8379)

* CPU optimization for ActivationOp (#8296)

* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (#8125)

* v0.12 regression: Fix registration of children for Block (#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from #8152

* Add tests from #8152

* Revert "[CMAKE] Fix windows cmake build" (#8311)

* Revert "Added my code signing key (#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (#8300)

* Update rnn.md (#8320)

* fluent methods for missed ops (#8329)

* update ps lite (#8327)

* Fix unused type warning (#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (#8369)

* Allow test to converge (#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (#7988)

* [Perl] emulate Python zip() for Perl (#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (#8376)

Fix a typo in the example readme.

* Fix GPU copy

* Remove duplicate

* Trigger build
cjolivier01 added a commit that referenced this pull request Oct 28, 2017
* Memory set/copy speed assertions

* Memory set/copy speed assertions

* ..

* ..

* ..

* ..

* bounce some cache

* lint

* Timing output for test_factorization_module when Verbose enabled (#8363)

* Timing output for test_factorization_module when Verbose enabled

* Trigger build

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (#8369)

* Allow test to converge (#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (#7988)

* [Perl] emulate Python zip() for Perl (#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (#8376)

Fix a typo in the example readme.

* Use omp_get_max_threads() when OMP_NUM_THREADS environment variable is set (#8379)

* CPU optimization for ActivationOp (#8296)

* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (#8125)

* v0.12 regression: Fix registration of children for Block (#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from #8152

* Add tests from #8152

* Revert "[CMAKE] Fix windows cmake build" (#8311)

* Revert "Added my code signing key (#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (#8300)

* Update rnn.md (#8320)

* fluent methods for missed ops (#8329)

* update ps lite (#8327)

* Fix unused type warning (#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (#8369)

* Allow test to converge (#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (#7988)

* [Perl] emulate Python zip() for Perl (#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (#8376)

Fix a typo in the example readme.

* do gtest test

* add assert and do higher runs as performance test only (when performance test flag set)

* Trigger build

* lint

* Trigger build

* Sparse operator performance improvement (#8412)

* sparse rsprsp perf improvements

* Clean up

* dtype default to source_array.dtype for sparse ndarrays (#8403)

* derive default dtype/ctx from input for sparse ndarrays

* add gpu tests

* fix lint. add doc

* remove default_ctx code

* bug fix when passing dtype to array()

* update doc

* remove extra line

* also check ctx

* fix using default mean pixels (#8352)

* fix gluon.data.RecordFileDataset (#8353)

* upgrade MKL (#8378)

* Lint fix (#8402)

* Trigger build
@szha szha force-pushed the gluon_tuple branch 5 times, most recently from 665555d to 8f4b085 Compare November 2, 2017 02:48
mapping = ('{_input_size} -> {_hidden_size}'.format(**self.__dict__) if self._input_size
else self._hidden_size)
shape = self.i2h_weight[0].shape
mapping = ('{0} -> {1}'.format(shape[1], shape[0]) if shape[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should show None -> xx when shape is not defined yet

@szha szha closed this Nov 3, 2017
@szha szha deleted the gluon_tuple branch December 15, 2017 20:23
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Fill optimizations

* Optimize IdentityCompute for CPU

* lint

* Fix unused type warning (apache#8316)

* remove unused variable

* CR comments

* CR comments

* Added _full operator

* Trigger build

* Trigger build

* Add _full to symbolic

* Merge conflict resolution fix

* lint

* Timing output for test_factorization_module when Verbose enabled (apache#8363)

* Timing output for test_factorization_module when Verbose enabled

* Trigger build

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.

* Use omp_get_max_threads() when OMP_NUM_THREADS environment variable is set (apache#8379)

* CPU optimization for ActivationOp (apache#8296)

* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (apache#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (apache#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (apache#8125)

* v0.12 regression: Fix registration of children for Block (apache#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152

* Revert "[CMAKE] Fix windows cmake build" (apache#8311)

* Revert "Added my code signing key (apache#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (apache#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (apache#8300)

* Update rnn.md (apache#8320)

* fluent methods for missed ops (apache#8329)

* update ps lite (apache#8327)

* Fix unused type warning (apache#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.

* Fix GPU copy

* Remove duplicate

* Trigger build
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Memory set/copy speed assertions

* Memory set/copy speed assertions

* ..

* ..

* ..

* ..

* bounce some cache

* lint

* Timing output for test_factorization_module when Verbose enabled (apache#8363)

* Timing output for test_factorization_module when Verbose enabled

* Trigger build

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.

* Use omp_get_max_threads() when OMP_NUM_THREADS environment variable is set (apache#8379)

* CPU optimization for ActivationOp (apache#8296)

* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (apache#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (apache#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (apache#8125)

* v0.12 regression: Fix registration of children for Block (apache#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152

* Revert "[CMAKE] Fix windows cmake build" (apache#8311)

* Revert "Added my code signing key (apache#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (apache#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (apache#8300)

* Update rnn.md (apache#8320)

* fluent methods for missed ops (apache#8329)

* update ps lite (apache#8327)

* Fix unused type warning (apache#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.

* do gtest test

* add assert and do higher runs as performance test only (when performance test flag set)

* Trigger build

* lint

* Trigger build

* Sparse operator performance improvement (apache#8412)

* sparse rsprsp perf improvements

* Clean up

* dtype default to source_array.dtype for sparse ndarrays (apache#8403)

* derive default dtype/ctx from input for sparse ndarrays

* add gpu tests

* fix lint. add doc

* remove default_ctx code

* bug fix when passing dtype to array()

* update doc

* remove extra line

* also check ctx

* fix using default mean pixels (apache#8352)

* fix gluon.data.RecordFileDataset (apache#8353)

* upgrade MKL (apache#8378)

* Lint fix (apache#8402)

* Trigger build
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants