-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-105] Fix CuDNN performance after code refactor #10116
Conversation
for (uint32_t i = 0; i < out_data.size(); ++i) { | ||
out_data[i] = nnvm::NodeEntry{n, i, 0}; | ||
} | ||
std::vector<nnvm::NodeEntry> heads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use reserve()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code runs to build the computation graph. It only runs once. Do we still need to call reserve()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please
src/operator/nn/batch_norm.cc
Outdated
// add all the auxiliary data | ||
//for (uint32_t i = 0; i < prop.aux_states.size(); ++i) { | ||
// inputs.emplace_back(ptr->inputs[i + prop.arguments.size()]); | ||
//} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
src/operator/nn/batch_norm.cu
Outdated
std::vector<TBlob> in_data(3); | ||
in_data[batchnorm::kData] = inputs[3]; | ||
in_data[batchnorm::kGamma] = inputs[4]; | ||
std::vector<TBlob> aux_states(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to aux states (running mean and variance)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CuDNN version doesn't need aux_states. CUDA version does. So aux_states is set properly to run CUDA code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this fix only improves CUDNN operator? Wouldn't we expect the other non-CUDNN operators to also be slower by the same amount?
This needs a JIRA ticket |
src/operator/nn/batch_norm.cc
Outdated
inputs.begin() + out_data_start); | ||
std::vector<NDArray> out_data(inputs.begin() + out_data_start, inputs.end()); | ||
std::vector<NDArray> in_grad(outputs.begin(), outputs.begin() + 3); | ||
static thread_local std::vector<NDArray> out_grad(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does thread_local help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't these hold a reference to the NDArray's Chunk data indefinitely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I'm trying to avoid memory allocation for std::vector.
But you are right. It potentially causes mem leak.
src/operator/nn/batch_norm-inl.h
Outdated
inputs.begin() + out_data_start); | ||
std::vector<TBlob> out_data(inputs.begin() + out_data_start, inputs.end()); | ||
std::vector<TBlob> in_grad(outputs.begin(), outputs.begin() + 3); | ||
static thread_local std::vector<TBlob> out_grad(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably too many thread_local.
Why are we copying the vectors in the first place?
why not change the interface of operator Forward/Backward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. I'll do that instead.
I measured the performance with (opts) and without (no opts) this PR and compare with the commit before #8302 (original) in the master branch. I ran each version 20 times and calculate the average and standard deviation. The performance is measured as image/second. This fix has got close to the original version. It's unclear where is the remaining perf loss.
The command to run the test:
|
}) | ||
} | ||
#else | ||
aux_states[batchnorm::kMovingMean] = inputs[6]; | ||
aux_states[batchnorm::kMovingVar] = inputs[7]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zheng-da aux_states is not defined if USE_CUDNN is not enabled. @marcoabreu seems there is no pure cuda ci environment which is not built with cudnn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see. i'll update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree. @marcoabreu could you add a CI only with CUDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, no problem at all! Compilation only or do we need tests as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to run the code at least once. We probably don't need to try both Python2 and Python3, something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: #10281
* Reduce #inputs/outputs of batchnorm backward. * Pass more arrays to BN. * Make std::vector thread local. * Set inputs of BN backward for other cases. * Fix for other cases. * remove commented code. * fix a potential mem leak. * Fix a compile error in mkldnn. * Fix an error. * reserve space for std::vector. * Fix alignment. * Fix cpp unit test. * Fix BN CPP unit tests. * Fix a compile error. * Fix compilation error. * Move Op signature. * Cache CuDNN conv op. * Fix compile error. * Fix compile error. * Remove thread_local. * Reduce mem alloc when caching cudnn conv. * Fix a lint error. * Cache CuDNN deconv. * Fix lint error.
* Reduce #inputs/outputs of batchnorm backward. * Pass more arrays to BN. * Make std::vector thread local. * Set inputs of BN backward for other cases. * Fix for other cases. * remove commented code. * fix a potential mem leak. * Fix a compile error in mkldnn. * Fix an error. * reserve space for std::vector. * Fix alignment. * Fix cpp unit test. * Fix BN CPP unit tests. * Fix a compile error. * Fix compilation error. * Move Op signature. * Cache CuDNN conv op. * Fix compile error. * Fix compile error. * Remove thread_local. * Reduce mem alloc when caching cudnn conv. * Fix a lint error. * Cache CuDNN deconv. * Fix lint error.
* Reduce #inputs/outputs of batchnorm backward. * Pass more arrays to BN. * Make std::vector thread local. * Set inputs of BN backward for other cases. * Fix for other cases. * remove commented code. * fix a potential mem leak. * Fix a compile error in mkldnn. * Fix an error. * reserve space for std::vector. * Fix alignment. * Fix cpp unit test. * Fix BN CPP unit tests. * Fix a compile error. * Fix compilation error. * Move Op signature. * Cache CuDNN conv op. * Fix compile error. * Fix compile error. * Remove thread_local. * Reduce mem alloc when caching cudnn conv. * Fix a lint error. * Cache CuDNN deconv. * Fix lint error.
* Reduce #inputs/outputs of batchnorm backward. * Pass more arrays to BN. * Make std::vector thread local. * Set inputs of BN backward for other cases. * Fix for other cases. * remove commented code. * fix a potential mem leak. * Fix a compile error in mkldnn. * Fix an error. * reserve space for std::vector. * Fix alignment. * Fix cpp unit test. * Fix BN CPP unit tests. * Fix a compile error. * Fix compilation error. * Move Op signature. * Cache CuDNN conv op. * Fix compile error. * Fix compile error. * Remove thread_local. * Reduce mem alloc when caching cudnn conv. * Fix a lint error. * Cache CuDNN deconv. * Fix lint error.
Description
This PR tries to fix the performance degradation report in #9874
We observed about 4% performance decrease. There are multiple factors that cause the performance decrease.
This PR tries to reduce these overhead.
Checklist
Essentials
make lint
)