-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
src/operator/roi_pooling.cc
Outdated
for (int c = 0; c < channels_; ++c) { | ||
// Increment all data pointers | ||
batch_data += c * data.size(2) * data.size(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please cache the xx.size(2) * you.size(3) result outside of the loops since they are invariant, so as to perform fewer multiplies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you do that, you can just keep adding that number each pass of c and then you don’t need to do a multiple for these at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multiply*
src/operator/roi_pooling.cc
Outdated
top_data += out.size(2) * out.size(3); | ||
argmax_data += max_idx.size(2) * max_idx.size(3); | ||
// Decrement all data pointers | ||
batch_data -= c * data.size(2) * data.size(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here but subtract
src/operator/roi_pooling.cc
Outdated
@@ -74,7 +79,13 @@ inline void ROIPoolForward(const Tensor<cpu, 4, Dtype> &out, | |||
|
|||
const Dtype* batch_data = bottom_data + data_size * roi_batch_ind; | |||
|
|||
#pragma omp parallel for firstprivate(batch_data, top_data, argmax_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why it’s not “parallel for” pragma?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean why there is a firstprivate
clause here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I totally missed the 'for' in the pragma. Sorry.
src/operator/roi_pooling.cc
Outdated
// Increment ROI data pointer | ||
bottom_rois += bbox.size(1); | ||
// Increase data pointers by one outsize | ||
top_data += channels_ * out_size_c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these muls be taken out of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for great comment, I've redesigned this part:)
src/operator/roi_pooling.cc
Outdated
const Dtype* batch_data_c = batch_data + c * data_size_c; | ||
Dtype* top_data_c = top_data + c * out_size_c; | ||
Dtype* argmax_data_c = argmax_data + c * max_idx_size_c; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
By the way, Multiplies are a lot slower than adds, so you can get even faster by reducing the muls in the loop ( + and * replace by single +):
const Dtype *batch_data_c = batch_data;
Dtype *top_data_c = top_data, *argmax_data_c = argmax_data;
for (int c = 0; c < channels_; ++c) {
// Increment all data pointers
batch_data_c += data_size_c;
top_data_c += out_size_c;
argmax_data_c += max_idx_size_c;
.
.
.
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I mean like this:
const Dtype *batch_data_c = batch_data;
Dtype *top_data_c = top_data, *argmax_data_c = argmax_data;
for (int c = 0; c < channels_; ++c,
batch_data_c += data_size_c,
top_data_c += out_size_c,
argmax_data_c += max_idx_size_c
) {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjolivier01 Thanks for the great comments to Xinyu.
Regarding replacing multiplication (FMA) to incremental addition (+=), two points of my view:
-
The incremental addition is more concise and logical rather than computing the index from start point each time by multiplication. But there're strong compute dependency and we CANT start the loop N+1 before the loop N.
Thus, we have to change to multiple styles for the parallelization. -
The efficiency of Multiplication (FMA) and addition (ADD) is same in the latest HW (Intel skylake)
Take SSE instruct as an example:
ADD, latency 4, CPI 0.5;
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=add&techs=SSE&expand=127,3673,127
FMA (MUL+ADD), latency 4, CPI 0.5
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=127,3673,2395,3673,2395,2407&text=mul&techs=FMA,Other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try to keep in mind in the future that we support less-awesome CPUs than Intel Skylake :)
Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E) We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id Thanks! |
LGTM |
@cjolivier01 |
it probably wouldn’t hurt. I don’t think it’s critical, though since channels tends to be a small number and it doesn’t look like the intent is to not use OMP if a GPU was used. it’s sort of like batch norm in this way. |
* [MXNET-67] Sync master with v1.1.0 branch (#10031) * [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README * Parallelization for ROIpooling OP (#9958) * parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again * comments to copy and copyto are corrected (#10040) * Bug Fix and performance optimized for rtc (#10018) * Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format * set embedding * Code and test revised * api implementation done * license and news * readme and cpp * pylint disable * Add API doc * less pylint disable * remove contrib * move to gluon, revise api doc * fix import order * re-test * relative imports * re-run test * revise implementation, test case, and api doc * re-test
* [MXNET-67] Sync master with v1.1.0 branch (apache#10031) * [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701) * Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)" This reverts commit 8930d96. * Some more LICENSE fixes * Adding some more packages to the LICENSE file * Adding dependencies of dependencies * update v1.1.0 change log to NEWS.md * sync README.md from v1.1.0 branch * revert to correct jenkins url in README * Parallelization for ROIpooling OP (apache#9958) * parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again * comments to copy and copyto are corrected (apache#10040) * Bug Fix and performance optimized for rtc (apache#10018) * Bug Fix and performance optimized for rtc 1. "super().__init__()" bug is fixed in python 2. 2. Kernel is initialized in the stage of operator init. * Update custom_softmax_rtc.py fix unnessesary format * set embedding * Code and test revised * api implementation done * license and news * readme and cpp * pylint disable * Add API doc * less pylint disable * remove contrib * move to gluon, revise api doc * fix import order * re-test * relative imports * re-run test * revise implementation, test case, and api doc * re-test
* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again
* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again
* parallelization for roipooling * remove some useless computation * remove useless muls * add author and retriggering * retrigger again
Description
What's the problem?
ROIpooling is used in the Faster-RCNN. It will consume lots of time in the inference path because of the current implementation (here) is not parallelized.
One profiling results as below:
What we have tried
We @pengzhao-intel @TaoLv have parallelized this pooling algorithm and got the 20+X performance improvement by OpenMP directives.
Checklist
Essentials
make lint
)Changes