Parallelization for ROIpooling OP #9958

xinyu-intel · 2018-03-02T05:50:39Z

Description

What's the problem?

ROIpooling is used in the Faster-RCNN. It will consume lots of time in the inference path because of the current implementation (here) is not parallelized.

One profiling results as below:

Time of each OP:	ms	ms/call	calls
ROIPooling	38718.288	1548.73152	25
Convolution	9963.628	3.724720748	2675
Reshape	0.677	0.00677	100
Activation	1089.6	0.427294118	2550
SoftmaxActivation	43.279	1.73116	25
add_n	1664.403	2.017458182	825
Pooling	68.531	1.37062	50
_contrib_Proposal	351.051	14.04204	25
softmax	0.658	0.02632	25
FullyConnected	29.328	0.58656	50
BatchNorm	2865.261	1.123631765	2550

What we have tried

We @pengzhao-intel @TaoLv have parallelized this pooling algorithm and got the 20+X performance improvement by OpenMP directives.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Parallelization for ROIpooling

cjolivier01 · 2018-03-02T06:02:48Z

src/operator/roi_pooling.cc

    for (int c = 0; c < channels_; ++c) {
+      // Increment all data pointers
+      batch_data += c * data.size(2) * data.size(3);


please cache the xx.size(2) * you.size(3) result outside of the loops since they are invariant, so as to perform fewer multiplies.

if you do that, you can just keep adding that number each pass of c and then you don’t need to do a multiple for these at all

cjolivier01 · 2018-03-02T06:04:11Z

src/operator/roi_pooling.cc

-      top_data += out.size(2) * out.size(3);
-      argmax_data += max_idx.size(2) * max_idx.size(3);
+      // Decrement all data pointers
+      batch_data -= c * data.size(2) * data.size(3);


same here but subtract

cjolivier01 · 2018-03-02T14:11:05Z

src/operator/roi_pooling.cc

@@ -74,7 +79,13 @@ inline void ROIPoolForward(const Tensor<cpu, 4, Dtype> &out,

    const Dtype* batch_data = bottom_data + data_size * roi_batch_ind;

+    #pragma omp parallel for firstprivate(batch_data, top_data, argmax_data)


curious why it’s not “parallel for” pragma?

Do you mean why there is a firstprivate clause here?

Oh, I totally missed the 'for' in the pragma. Sorry.

cjolivier01 · 2018-03-02T17:57:22Z

src/operator/roi_pooling.cc

-    // Increment ROI data pointer
-    bottom_rois += bbox.size(1);
+    // Increase data pointers by one outsize
+    top_data += channels_ * out_size_c;


Can these muls be taken out of the loop?

Thanks for great comment, I've redesigned this part:)

cjolivier01 · 2018-03-02T18:00:29Z

src/operator/roi_pooling.cc

+      const Dtype* batch_data_c = batch_data + c * data_size_c;
+      Dtype* top_data_c = top_data + c * out_size_c;
+      Dtype* argmax_data_c = argmax_data + c * max_idx_size_c;
+


Nice work!
By the way, Multiplies are a lot slower than adds, so you can get even faster by reducing the muls in the loop ( + and * replace by single +):

const Dtype *batch_data_c = batch_data; Dtype *top_data_c = top_data, *argmax_data_c = argmax_data; for (int c = 0; c < channels_; ++c) { // Increment all data pointers batch_data_c += data_size_c; top_data_c += out_size_c; argmax_data_c += max_idx_size_c; . . . }

In other places too...

Sorry, I mean like this:

const Dtype *batch_data_c = batch_data; Dtype *top_data_c = top_data, *argmax_data_c = argmax_data; for (int c = 0; c < channels_; ++c, batch_data_c += data_size_c, top_data_c += out_size_c, argmax_data_c += max_idx_size_c ) { }

@cjolivier01 Thanks for the great comments to Xinyu.
Regarding replacing multiplication (FMA) to incremental addition (+=), two points of my view:

The incremental addition is more concise and logical rather than computing the index from start point each time by multiplication. But there're strong compute dependency and we CANT start the loop N+1 before the loop N.
Thus, we have to change to multiple styles for the parallelization.

The efficiency of Multiplication (FMA) and addition (ADD) is same in the latest HW (Intel skylake)
Take SSE instruct as an example:
ADD, latency 4, CPI 0.5;
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=add&techs=SSE&expand=127,3673,127
FMA (MUL+ADD), latency 4, CPI 0.5
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=127,3673,2395,3673,2395,2407&text=mul&techs=FMA,Other

try to keep in mind in the future that we support less-awesome CPUs than Intel Skylake :)

piiswrong · 2018-03-06T19:30:57Z

@cjolivier01

CodingCat · 2018-03-07T04:39:00Z

Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E)

We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id

Thanks!

cjolivier01 · 2018-03-09T01:56:01Z

LGTM

piiswrong · 2018-03-11T09:40:54Z

@cjolivier01
Shouldn't the omp pragma get number of threads from OpenMP::Get()?

cjolivier01 · 2018-03-11T10:31:30Z

it probably wouldn’t hurt. I don’t think it’s critical, though since channels tends to be a small number and it doesn’t look like the intent is to not use OMP if a GPU was used. it’s sort of like batch norm in this way.
but i suppose
for the sake of consistency it maybe should be fixed

* [MXNET-67] Sync master with v1.1.0 branch (#10031) * [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README * Parallelization for ROIpooling OP (#9958) * parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again * comments to copy and copyto are corrected (#10040) * Bug Fix and performance optimized for rtc (#10018) * Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format * set embedding * Code and test revised * api implementation done * license and news * readme and cpp * pylint disable * Add API doc * less pylint disable * remove contrib * move to gluon, revise api doc * fix import order * re-test * relative imports * re-run test * revise implementation, test case, and api doc * re-test

* [MXNET-67] Sync master with v1.1.0 branch (apache#10031) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README * Parallelization for ROIpooling OP (apache#9958) * parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again * comments to copy and copyto are corrected (apache#10040) * Bug Fix and performance optimized for rtc (apache#10018) * Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format * set embedding * Code and test revised * api implementation done * license and news * readme and cpp * pylint disable * Add API doc * less pylint disable * remove contrib * move to gluon, revise api doc * fix import order * re-test * relative imports * re-run test * revise implementation, test case, and api doc * re-test

* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again

parallelization for roipooling

c3e65e5

xinyu-intel requested a review from cjolivier01 as a code owner March 2, 2018 05:50

cjolivier01 suggested changes Mar 2, 2018

View reviewed changes

remove some useless computation

1ac728d

cjolivier01 reviewed Mar 2, 2018

View reviewed changes

cjolivier01 suggested changes Mar 2, 2018

View reviewed changes

xinyu-intel and others added 3 commits March 4, 2018 20:59

remove useless muls

5b0412b

add author and retriggering

94665f1

retrigger again

6dfac28

cjolivier01 merged commit 884408a into apache:master Mar 9, 2018

xinyu-intel deleted the roipooling branch March 25, 2018 02:19

xinyu-intel mentioned this pull request May 9, 2018

[MXNET-411] Add ROI Align #10852

Merged

7 tasks

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018

Parallelization for ROIpooling OP (apache#9958)

fcf77fe

* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again

zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018

Parallelization for ROIpooling OP (apache#9958)

53a5609

* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization for ROIpooling OP #9958

Parallelization for ROIpooling OP #9958

xinyu-intel commented Mar 2, 2018

cjolivier01 Mar 2, 2018

cjolivier01 Mar 2, 2018

cjolivier01 Mar 2, 2018

cjolivier01 Mar 2, 2018

cjolivier01 Mar 2, 2018

TaoLv Mar 2, 2018

cjolivier01 Mar 2, 2018

cjolivier01 Mar 2, 2018

xinyu-intel Mar 4, 2018

cjolivier01 Mar 2, 2018

cjolivier01 Mar 2, 2018

cjolivier01 Mar 2, 2018

pengzhao-intel Mar 4, 2018

cjolivier01 Mar 9, 2018

cjolivier01 Mar 9, 2018

piiswrong commented Mar 6, 2018

CodingCat commented Mar 7, 2018

cjolivier01 commented Mar 9, 2018

piiswrong commented Mar 11, 2018

cjolivier01 commented Mar 11, 2018

		@@ -74,7 +79,13 @@ inline void ROIPoolForward(const Tensor<cpu, 4, Dtype> &out,

		const Dtype* batch_data = bottom_data + data_size * roi_batch_ind;

		#pragma omp parallel for firstprivate(batch_data, top_data, argmax_data)

Parallelization for ROIpooling OP #9958

Parallelization for ROIpooling OP #9958

Conversation

xinyu-intel commented Mar 2, 2018

Description

Checklist

Essentials

Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piiswrong commented Mar 6, 2018

CodingCat commented Mar 7, 2018

cjolivier01 commented Mar 9, 2018

piiswrong commented Mar 11, 2018

cjolivier01 commented Mar 11, 2018