Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Parallelization for ROIpooling OP #9958

Merged
merged 5 commits into from
Mar 9, 2018
Merged

Conversation

xinyu-intel
Copy link
Contributor

Description

What's the problem?

ROIpooling is used in the Faster-RCNN. It will consume lots of time in the inference path because of the current implementation (here) is not parallelized.

One profiling results as below:

Time of each OP: ms ms/call calls
ROIPooling 38718.288 1548.73152 25
Convolution 9963.628 3.724720748 2675
Reshape 0.677 0.00677 100
Activation 1089.6 0.427294118 2550
SoftmaxActivation 43.279 1.73116 25
add_n 1664.403 2.017458182 825
Pooling 68.531 1.37062 50
_contrib_Proposal 351.051 14.04204 25
softmax 0.658 0.02632 25
FullyConnected 29.328 0.58656 50
BatchNorm 2865.261 1.123631765 2550

What we have tried

We @pengzhao-intel @TaoLv have parallelized this pooling algorithm and got the 20+X performance improvement by OpenMP directives.

Faster-RCNN

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Parallelization for ROIpooling

for (int c = 0; c < channels_; ++c) {
// Increment all data pointers
batch_data += c * data.size(2) * data.size(3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please cache the xx.size(2) * you.size(3) result outside of the loops since they are invariant, so as to perform fewer multiplies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do that, you can just keep adding that number each pass of c and then you don’t need to do a multiple for these at all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiply*

top_data += out.size(2) * out.size(3);
argmax_data += max_idx.size(2) * max_idx.size(3);
// Decrement all data pointers
batch_data -= c * data.size(2) * data.size(3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here but subtract

@@ -74,7 +79,13 @@ inline void ROIPoolForward(const Tensor<cpu, 4, Dtype> &out,

const Dtype* batch_data = bottom_data + data_size * roi_batch_ind;

#pragma omp parallel for firstprivate(batch_data, top_data, argmax_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why it’s not “parallel for” pragma?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean why there is a firstprivate clause here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I totally missed the 'for' in the pragma. Sorry.

// Increment ROI data pointer
bottom_rois += bbox.size(1);
// Increase data pointers by one outsize
top_data += channels_ * out_size_c;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these muls be taken out of the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for great comment, I've redesigned this part:)

const Dtype* batch_data_c = batch_data + c * data_size_c;
Dtype* top_data_c = top_data + c * out_size_c;
Dtype* argmax_data_c = argmax_data + c * max_idx_size_c;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
By the way, Multiplies are a lot slower than adds, so you can get even faster by reducing the muls in the loop ( + and * replace by single +):

const Dtype *batch_data_c = batch_data;
Dtype *top_data_c = top_data, *argmax_data_c = argmax_data; 
for (int c = 0; c < channels_; ++c) {
  // Increment all data pointers
  batch_data_c  += data_size_c;
  top_data_c    += out_size_c;
  argmax_data_c += max_idx_size_c;
  .
  .
  .
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places too...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I mean like this:

const Dtype *batch_data_c = batch_data;
Dtype *top_data_c = top_data, *argmax_data_c = argmax_data;
for (int c = 0; c < channels_; ++c, 
                               batch_data_c  += data_size_c,
                               top_data_c    += out_size_c,
                               argmax_data_c += max_idx_size_c 
) {

}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjolivier01 Thanks for the great comments to Xinyu.
Regarding replacing multiplication (FMA) to incremental addition (+=), two points of my view:

  1. The incremental addition is more concise and logical rather than computing the index from start point each time by multiplication. But there're strong compute dependency and we CANT start the loop N+1 before the loop N.
    Thus, we have to change to multiple styles for the parallelization.

  2. The efficiency of Multiplication (FMA) and addition (ADD) is same in the latest HW (Intel skylake)
    Take SSE instruct as an example:
    ADD, latency 4, CPI 0.5;
    https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=add&techs=SSE&expand=127,3673,127
    FMA (MUL+ADD), latency 4, CPI 0.5
    https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=127,3673,2395,3673,2395,2407&text=mul&techs=FMA,Other

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to keep in mind in the future that we support less-awesome CPUs than Intel Skylake :)

@piiswrong
Copy link
Contributor

@cjolivier01

@CodingCat
Copy link
Contributor

Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E)

We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id

Thanks!

@cjolivier01
Copy link
Member

LGTM

@cjolivier01 cjolivier01 merged commit 884408a into apache:master Mar 9, 2018
@piiswrong
Copy link
Contributor

@cjolivier01
Shouldn't the omp pragma get number of threads from OpenMP::Get()?

@cjolivier01
Copy link
Member

it probably wouldn’t hurt. I don’t think it’s critical, though since channels tends to be a small number and it doesn’t look like the intent is to not use OMP if a GPU was used. it’s sort of like batch norm in this way.
but i suppose
for the sake of consistency it maybe should be fixed

szha pushed a commit that referenced this pull request Mar 15, 2018
* [MXNET-67] Sync master with v1.1.0 branch (#10031)

* [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701)

* Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)"

This reverts commit 8930d96.

* Some more LICENSE fixes

* Adding some more packages to the LICENSE file

* Adding dependencies of dependencies

* update v1.1.0 change log to NEWS.md

* sync README.md from v1.1.0 branch

* revert to correct jenkins url in README

* Parallelization for ROIpooling OP (#9958)

* parallelization for roipooling

* remove some useless computation

* remove useless muls

* add author and retriggering

* retrigger again

* comments to copy and copyto are corrected (#10040)

* Bug Fix and performance optimized for rtc (#10018)

* Bug Fix and performance optimized for rtc

1. "super().__init__()" bug is fixed in python 2.
2. Kernel is initialized in the stage of operator init.

* Update custom_softmax_rtc.py

fix unnessesary format

* set embedding

* Code and test revised

* api implementation done

* license and news

* readme and cpp

* pylint disable

* Add API doc

* less pylint disable

* remove contrib

* move to gluon, revise api doc

* fix import order

* re-test

* relative imports

* re-run test

* revise implementation, test case, and api doc

* re-test
szha pushed a commit to szha/mxnet that referenced this pull request Mar 15, 2018
* [MXNET-67] Sync master with v1.1.0 branch (apache#10031)

* [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701)

* Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)"

This reverts commit 8930d96.

* Some more LICENSE fixes

* Adding some more packages to the LICENSE file

* Adding dependencies of dependencies

* update v1.1.0 change log to NEWS.md

* sync README.md from v1.1.0 branch

* revert to correct jenkins url in README

* Parallelization for ROIpooling OP (apache#9958)

* parallelization for roipooling

* remove some useless computation

* remove useless muls

* add author and retriggering

* retrigger again

* comments to copy and copyto are corrected (apache#10040)

* Bug Fix and performance optimized for rtc (apache#10018)

* Bug Fix and performance optimized for rtc

1. "super().__init__()" bug is fixed in python 2.
2. Kernel is initialized in the stage of operator init.

* Update custom_softmax_rtc.py

fix unnessesary format

* set embedding

* Code and test revised

* api implementation done

* license and news

* readme and cpp

* pylint disable

* Add API doc

* less pylint disable

* remove contrib

* move to gluon, revise api doc

* fix import order

* re-test

* relative imports

* re-run test

* revise implementation, test case, and api doc

* re-test
@xinyu-intel xinyu-intel deleted the roipooling branch March 25, 2018 02:19
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* parallelization for roipooling

* remove some useless computation

* remove useless muls

* add author and retriggering

* retrigger again
@xinyu-intel xinyu-intel mentioned this pull request May 9, 2018
7 tasks
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* parallelization for roipooling

* remove some useless computation

* remove useless muls

* add author and retriggering

* retrigger again
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* parallelization for roipooling

* remove some useless computation

* remove useless muls

* add author and retriggering

* retrigger again
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants