Skip to content

Commit

Permalink
CPU optimization for ActivationOp (apache#8296)
Browse files Browse the repository at this point in the history
* CPU optimization for ActivationOp

Significant improvement on CPU (several magnitudes of order in some cases, especially on backward pass).
Very slight improvement on GPU.

OLD MSHADOW APPROACH
--------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 18.948 ms, avg: 0.037896 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.658 ms, avg: 0.003316 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 57.973 ms, avg: 0.115946 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 4.748 ms, avg: 0.009496 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 703.446 ms, avg: 1.40689 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 56.255 ms, avg: 0.11251 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 2107.77 ms, avg: 4.21554 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 168.483 ms, avg: 0.336966 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 24122.2 ms, avg: 48.2443 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1908.7 ms, avg: 3.8174 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.637 ms, avg: 0.003274 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.665 ms, avg: 0.00333 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.562 ms, avg: 0.003124 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.661 ms, avg: 0.003322 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.635 ms, avg: 0.00327 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.702 ms, avg: 0.003404 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.83 ms, avg: 0.00366 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.041 ms, avg: 0.004082 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.08 ms, avg: 0.00416 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.688 ms, avg: 0.005376 ms X 500 passes

NEW MXNET_OP APPROACH
---------------------

CPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator CPU:  Timing [Forward] 80.748 ms, avg: 0.161496 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 1.176 ms, avg: 0.002352 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator CPU:  Timing [Forward] 7.881 ms, avg: 0.015762 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 2.181 ms, avg: 0.004362 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator CPU:  Timing [Forward] 111.48 ms, avg: 0.22296 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 5.408 ms, avg: 0.010816 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator CPU:  Timing [Forward] 333.439 ms, avg: 0.666878 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 21.331 ms, avg: 0.042662 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator CPU:  Timing [Forward] 3429.19 ms, avg: 6.85837 ms X 500 passes
Activation Operator CPU:  Timing [Backward] 286.324 ms, avg: 0.572648 ms X 500 passes

GPU
===

Timing: 50 iterations of 10 calls, shape = [1,1,28,28]
Activation Operator GPU:  Timing [Forward] 1.618 ms, avg: 0.003236 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.671 ms, avg: 0.003342 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [1,3,28,28]
Activation Operator GPU:  Timing [Forward] 1.629 ms, avg: 0.003258 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.728 ms, avg: 0.003456 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,1,18,32]
Activation Operator GPU:  Timing [Forward] 1.753 ms, avg: 0.003506 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.756 ms, avg: 0.003512 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [50,3,18,32]
Activation Operator GPU:  Timing [Forward] 1.704 ms, avg: 0.003408 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 1.791 ms, avg: 0.003582 ms X 500 passes

Timing: 50 iterations of 10 calls, shape = [20,3,128,128]
Activation Operator GPU:  Timing [Forward] 2.032 ms, avg: 0.004064 ms X 500 passes
Activation Operator GPU:  Timing [Backward] 2.143 ms, avg: 0.004286 ms X 500 passes

* lint

* Trigger build

* Trigger build

* Negative begin and end support for csr slice (apache#8241)

* negative index support for sparse slice

* fix lint

* getitem(int) for csr ndarray, support a[-1]

* remove unneccessary argument

* unittest and doc update

* Preparing for 0.12.0.rc0: Final changes before RC (apache#8301)

* Final changes before RC

* Updates to NEWS.md

* Updates

* Enable smoothing in softmax operator (apache#8125)

* v0.12 regression: Fix registration of children for Block (apache#8277)

* Fix Block not registering children

If the attribute was already set to something different than Block (e.g. None),
it was not being registered.

* fix if / elif for block children registration

* trigger test

* Add fix from apache#8152

* Add tests from apache#8152

* Revert "[CMAKE] Fix windows cmake build" (apache#8311)

* Revert "Added my code signing key (apache#8293)"

This reverts commit 22ab185.

* Revert "[CMAKE] Fix windows cmake build (apache#8227)"

This reverts commit 1c1c788.

* fixed broken links. https was pointing to http for mxnet.io (apache#8300)

* Update rnn.md (apache#8320)

* fluent methods for missed ops (apache#8329)

* update ps lite (apache#8327)

* Fix unused type warning (apache#8316)

* Trigger build

* Trigger build

* Misc fixes for sparse distributed training (apache#8345)

* remove mshadow::range in init_op.h

* add unit test

* remove pass by ptr, add unit test for pull empty wieghts

* fix range in key partition

* remove wrong comment

* remove change for partition

* remove unused var

* add int64 to arange. add checkpointing example

* Fix the Readme (apache#8369)

* Allow test to converge (apache#8351)

* Allow test to converge

* Trigger build

* Trigger build

* Trigger build

* Update cudnn_algoreg-inl.h (apache#7988)

* [Perl] emulate Python zip() for Perl (apache#8192)

* [Perl] emulate Python zip() for Perl

* [Perl] retool zip() uses away from the callback form

* add profile option for frontend profiling to image script (apache#8171)

* add profile option for frontend profiling to image script

* Update image_classification.py

* Update image_classification.py

* Fix Typo (classification) (apache#8376)

Fix a typo in the example readme.
  • Loading branch information
cjolivier01 authored and crazy-cat committed Oct 26, 2017
1 parent 2455b4f commit 00a0819
Show file tree
Hide file tree
Showing 9 changed files with 557 additions and 29 deletions.
35 changes: 27 additions & 8 deletions src/operator/activation-inl.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
* \brief Activation operator
* \author Bing Xu
*/

#ifndef MXNET_OPERATOR_ACTIVATION_INL_H_
#define MXNET_OPERATOR_ACTIVATION_INL_H_

Expand All @@ -34,6 +35,7 @@
#include <vector>
#include <utility>
#include "./operator_common.h"
#include "./mxnet_op.h"

namespace mxnet {
namespace op {
Expand Down Expand Up @@ -75,9 +77,16 @@ class ActivationOp : public Operator {
CHECK_EQ(in_data.size(), 1U);
CHECK_EQ(out_data.size(), 1U);
Stream<xpu> *s = ctx.get_stream<xpu>();
Tensor<xpu, 2, DType> data = in_data[activation::kData].FlatTo2D<xpu, DType>(s);
Tensor<xpu, 2, DType> out = out_data[activation::kOut].FlatTo2D<xpu, DType>(s);
Assign(out, req[activation::kOut], F<ForwardOp>(data));
const TBlob& input = in_data[activation::kData];
const size_t sz = input.shape_.Size();
if (sz) {
MXNET_ASSIGN_REQ_SWITCH(req[activation::kOut], Req, {
mxnet_op::Kernel<mxnet_op::op_with_req<ForwardOp, Req>, xpu>::Launch(
s, sz,
out_data[activation::kOut].dptr<DType>(),
input.dptr<DType>());
});
}
}

virtual void Backward(const OpContext &ctx,
Expand All @@ -93,14 +102,24 @@ class ActivationOp : public Operator {
CHECK(in_data.size() == 1 && in_grad.size() == 1);
CHECK_EQ(req.size(), 1U);
Stream<xpu> *s = ctx.get_stream<xpu>();
Tensor<xpu, 2, DType> m_out_grad = out_grad[activation::kOut].FlatTo2D<xpu, DType>(s);
Tensor<xpu, 2, DType> m_out_data = out_data[activation::kOut].FlatTo2D<xpu, DType>(s);
Tensor<xpu, 2, DType> m_in_grad = in_grad[activation::kData].FlatTo2D<xpu, DType>(s);
Assign(m_in_grad, req[activation::kData], F<BackwardOp>(m_out_data) * m_out_grad);
const TBlob& m_out_grad = out_grad[activation::kOut];
const TBlob& m_out_data = out_data[activation::kOut];
const TBlob& m_in_grad = in_grad[activation::kData];
const size_t sz = m_out_data.shape_.Size();
if (sz) {
MXNET_ASSIGN_REQ_SWITCH(req[activation::kData], Req, {
mxnet_op::Kernel<mxnet_op::op_with_req<
mxnet::op::mxnet_op::backward_grad<BackwardOp>, Req>, xpu>::Launch(
s, sz,
m_in_grad.dptr<DType>(),
m_out_grad.dptr<DType>(),
m_out_data.dptr<DType>());
});
}
}
}; // class ActivationOp

// Decalre Factory function, used for dispatch specialization
// Declare Factory function, used for dispatch specialization
template<typename xpu>
Operator* CreateOp(ActivationParam type, int dtype, const TShape& dshape);

Expand Down
14 changes: 14 additions & 0 deletions src/operator/mxnet_op.h
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,20 @@ struct set_zero {
}
};

/*! \brief Binary op backward gradient OP wrapper */
template<typename GRAD_OP>
struct backward_grad {
/* \brief Backward calc with grad
* \param a - output grad
* \param args... - data to grad calculation op (what this is -- input, output, etc. -- varies)
* \return input grad
*/
template<typename DType, typename ...Args>
MSHADOW_XINLINE static DType Map(DType a, Args... args) {
return DType(a * GRAD_OP::Map(args...));
}
};

/*! \brief Select assignment operation based upon the req value
* Also useful for mapping mshadow Compute (F<OP>) to Kernel<OP>::Launch
*/
Expand Down
48 changes: 38 additions & 10 deletions tests/cpp/include/test_op.h
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,8 @@ class BasicOperatorData {
#endif
, initializeForward_(0) // unit testing may call inits in any order based
, initializeBackward_(0) // upon its use-case (ie may not want to run forward pass first)
, initializeCallback_(0) {
, initializeCallback_(0)
, generator_(new std::mt19937()) {
opContext_.is_train = true;
opContext_.run_ctx.stream = nullptr;

Expand All @@ -123,10 +124,14 @@ class BasicOperatorData {
shape_input_vec_.resize(opProp.ListArguments().size());
op_.reset(opProp.CreateOperatorEx(getContext(), &shape_input_vec_, in_type));
if (op_) {
const size_t output_count = opProp.ListOutputs().size();
const size_t aux_count = opProp.ListAuxiliaryStates().size();
// Figure out what sort of blobs we need to allocate
std::vector<TShape> out_shape, aux_shape;
out_shape.resize(output_count);
aux_shape.resize(aux_count);
opProp.InferShape(&shape_input_vec_, &out_shape, &aux_shape);
std::vector<int> out_type, aux_type;
std::vector<int> out_type(output_count, -1), aux_type(aux_count, -1);
opProp.InferType(in_type, &out_type, &aux_type);

// Allocate top blobs (input)
Expand Down Expand Up @@ -174,9 +179,9 @@ class BasicOperatorData {
initForward(opProp, in_type);
if (!initializeBackward_++) {
for (size_t x = 0, n = static_cast<size_t>(opProp.NumVisibleOutputs()); x < n; ++x) {
CHECK_LT(x, c_.blob_input_vec_.size());
allocateBlob(&c_.blob_out_grad_, c_.blob_input_vec_[x].shape_,
false, c_.blob_input_vec_[x].type_flag_);
CHECK_LT(x, c_.blob_output_vec_.size());
allocateBlob(&c_.blob_out_grad_, c_.blob_output_vec_[x].shape_,
false, c_.blob_output_vec_[x].type_flag_);
}

for (size_t x = 0, n = c_.blob_input_vec_.size(); x < n; ++x) {
Expand All @@ -197,6 +202,7 @@ class BasicOperatorData {

/*! \brief Run operator forward */
void forward(const size_t count = 1) {
const std::vector<OpReqType> req(c_.blob_output_vec_.size(), kWriteTo);
// Possibly move data to/from CPU and GPU (outside of timing scope)
MXNET_CUDA_ONLY(std::unique_ptr<GPUOpData> gpuData(isGPU_ ?
new GPUOpData(c_, &opContext_) : nullptr));
Expand All @@ -206,15 +212,15 @@ class BasicOperatorData {
for (size_t x = 0; x < count; ++x) {
op()->Forward(opContext_,
c_.blob_input_vec_,
{kWriteTo, kWriteTo, kWriteTo},
req,
c_.blob_output_vec_,
c_.blob_aux_states_);
}
} else {
for (size_t x = 0; x < count; ++x) {
MXNET_CUDA_ONLY(op()->Forward(opContext_,
gpuData->blob_input_vec_,
{kWriteTo, kWriteTo, kWriteTo},
req,
gpuData->blob_output_vec_,
gpuData->blob_aux_states_));
}
Expand All @@ -223,6 +229,7 @@ class BasicOperatorData {

/*! \brief Run operator backwards */
void backward(const size_t count = 1) {
const std::vector<OpReqType> req(c_.blob_output_vec_.size(), kWriteTo);
// Possibly move data to/from CPU and GPU (outside of timing scope)
MXNET_CUDA_ONLY(std::unique_ptr<GPUOpData> gpuData(isGPU_ ?
new GPUOpData(c_, &opContext_) : nullptr));
Expand All @@ -234,7 +241,7 @@ class BasicOperatorData {
c_.blob_out_grad_,
c_.blob_input_vec_,
c_.blob_output_vec_,
{kWriteTo, kWriteTo, kWriteTo},
req,
c_.blob_in_grad_,
c_.blob_aux_states_);
}
Expand All @@ -244,7 +251,7 @@ class BasicOperatorData {
gpuData->blob_out_grad_,
gpuData->blob_input_vec_,
gpuData->blob_output_vec_,
{kWriteTo, kWriteTo, kWriteTo},
req,
gpuData->blob_in_grad_,
gpuData->blob_aux_states_));
}
Expand Down Expand Up @@ -386,6 +393,21 @@ class BasicOperatorData {
copy(blob, sourceData, 0, sourceDataSize);
}

void FillRandom() {
std::uniform_real_distribution<DType> distribution(-1.0, 1.0);
for (size_t j = 0, jn = this->c_.all_blob_vects_.size(); j < jn; ++j) {
std::vector<TBlob> *data_vect = this->c_.all_blob_vects_[j];
if (data_vect) {
for (size_t i = 0, n = data_vect->size(); i < n; ++i) {
TBlob &blob = (*data_vect)[i];
test::patternFill<DType>(&blob, [this, &distribution]() -> DType {
return distribution(generator());
});
}
}
}
}

/*! \brief Input and output blobs */
OpContext opContext_;

Expand Down Expand Up @@ -520,6 +542,9 @@ class BasicOperatorData {
return allocateBlob(&standalone_blobs_, dest, shape, isGPU, dtype);
}

/*! \brief mt19937 generator for random number generator */
std::mt19937& generator() { return *generator_; }

/*! \brief Performance timing categories */
enum TimingId {
Forward,
Expand All @@ -539,6 +564,9 @@ class BasicOperatorData {
/*! \brief scoped lifecycle management of allocated blobs */
std::list<std::unique_ptr<test::StandaloneBlob>> standalone_blobs_;

/*! \brief Per-test generator */
std::unique_ptr<std::mt19937> generator_;

public:
/*! Timing instrumentation */
test::perf::TimingInstrument timing_;
Expand Down Expand Up @@ -675,7 +703,7 @@ class Validator {
}
const TBlob& b1 = bv1[idx];
const TBlob& b2 = bv2[idx];
if (print && test::debugOutput) {
if (print && test::debug_output) {
test::print(RunContext(), &(std::cout << "Blob 1:"), b1, true, true);
test::print(RunContext(), &(std::cout << "Blob 2:"), b2, true, true);
}
Expand Down
Loading

0 comments on commit 00a0819

Please sign in to comment.