-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUB Memory Manager + cuDNN v4 and v5 support #3919
Conversation
Right now Github refuses to render the entire diff (too many changes), so the only thing that one can see are the added |
OK, thank you. Split into 2 commits. |
d41bfbe
to
f061e51
Compare
@@ -316,6 +317,21 @@ class Layer { | |||
param_propagate_down_[param_id] = value; | |||
} | |||
|
|||
bool IsForwardPassed() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to not have these as a part of Layer
? AFAICT you can have exactly the same effect by just putting these forward_passed/backward_passed as instance variables in CuDNNBatchNormalizationLayer, and you don't add (essentially) completely unused base functions to a core class.
IMO it's preferable to keep the core classes like Layer, Blob as small as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the annoyance of Caffe's lazy allocation. There are allocations that happen late, specifically setup of cuRand, so we need to be able to delay final setup of algorithm choice until we know how much memory we really have. (There is an upcoming PR to move to the findEx paths in cudnnv5 instead of get that make this even more challenging). We think we need this more generally. Would love a better solution, but we start going down a "plan" style path pretty quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Andrew, there are two reasons for keeping them in base class:
- In Net::ForwardFromTo and Net::BackwardFromTo we call setters using pointers to base classes (i.e. pure virtuals would be even more expensive here).
- Most probably there will be more use cases like this in other cuDNN-based layer implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I've misunderstood, it's hard to believe that there is any real performance implication due to 1.
If I'm imagining it correctly, what @thatguymike is calling a '"plan" style path' sounds right to me, though admittedly difficult to implement in current Caffe. [In fact, I have code of my own, (not as a Caffe branch) that takes that path, so it might be nice to discuss that elsewhere.]
Currently, however, this is a hack. Note that in current usage, memory usage can change (potentially drastically) after the initial forward-backward pass due to reshaping. So it's better not to push such things into the core of Caffe. If it becomes useful to share this kind of code among cuDNN layers, that can be done with another (intermediate) class, or some other kind of helper. This is also a (minor?) violation of modularity; layers are not really supposed to know anything about other layers or nets or whoever is running them.
I agree with @ajtulloch here -- I see no compelling reason to add these functions to all layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@longjon Now please look at net.cpp:
- for (int i = start; i <= end; ++i) {
- layers_[i]->ForwardPassed(true);
- }
Here layers[i] is a pointer to the base class Layer. How would we call ForwardPassed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't, of course. Instead, you'd need to set forward_passed_
in layer code, presumably at the end of Forward_*
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Perhaps the intention here was to keep track not simply of whether an individual layer's forward has completed, but whether all layers' forwards' have completed. But note that that's not what this code does anyway.)
f061e51
to
4044ed7
Compare
That's cool that we can do batch normalization using cuDNN. So when we use the CudnnBatchNorm layer, do we still need to append a Scale layer after it? |
No you don;t have to add scale/shift layers. CuDNN BN layer has both scale and shift inside. |
4044ed7
to
c9eda39
Compare
Removed unnecessary files from 3rdparty directory. Just 5 of them left - those are the only required. |
Cool! How do I make changes to use this module? Do I need to add a new Layer before any ReLU layer? Or it will automatically do BN when I specify a variable in protxt file. |
Regarding BN layer usage. I've recently posted in NVIDIA/DIGITS#629 about some differences between NVIDIA Caffe and BVLC version. Since this PR has the same BatchNormParameter message in caffe.proto like the current NVIDIA's one, the following example should work in BVLC Caffe as well:
There's no need to specify use_global_stats since Cafffe automatically infers the correct state TEST or TRAIN. The same layer definition is used for both train_val and deploy prototxt. |
Added "Returned shift and scale back to BN layer" commit to make BN layer implementation consistent with the paper, cuDNN and other frameworks. |
51bbfa6
to
d3285ae
Compare
Today If you want to use BN layer without scale and shift, then you can initialize these two parameters with 1 and 0, and set lr and weight decay to 0 in the train_val.prototxt. I can add a new parameter, which can do this automatically.
|
void BatchNormLayer<Dtype>::compute_sum_per_channel_gpu(int N, int C, int S, | ||
const Dtype *x, Dtype *y ) { | ||
// assume that x.shape(1)==C | ||
Blob<Dtype> temp_NC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this was meant to be temp_NC_
? Do you want to s/temp_NC/temp_NC_/g
in these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally it was temp_NC_, but then I decided that it can be more safe to allocate temp_NC inside of these 2 functions to avoid potential hidden over-write (for example if I will use temp_NC_ outside) . I will re-check if this have noticable effect on time.
d3285ae
to
2c4cfbd
Compare
@@ -22,17 +23,29 @@ void BatchNormLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom, | |||
if (this->blobs_.size() > 0) { | |||
LOG(INFO) << "Skipping parameter initialization"; | |||
} else { | |||
this->blobs_.resize(3); | |||
this->blobs_.resize(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 5 blobs? blobs_[4] is not used in the current code. And according to usage of BatchNorm layer in the prototxt file given, there should be 4 blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blobs_[4] is referenced in a few places, but I agree that it looks like it's not used anywhere.
@borisgin can you comment on this?
If this BN method is used, will I be able to load a cafemodel (for finetuning) trained using the old BN method. I tried this in the nvidia/cafe repo and load of caffemodel exited complaining that the number of blobs doesn't match. This could be a critical need as there are several older caffemodels out there that we need to use. |
No. Old BN has 2 blobs: global mean and variance., New one has 5: scale, bias, gloabl_mean, and global_variance, global_counter |
This could be a problem since several popular models become un-usable. (eg. https://github.com/KaimingHe/deep-residual-networks) Do we really have to break the backward compatibility this way? In the old BN we used to have a separate scaling layer and that was fine. Kindly re-consider this and try to keep the compatibility. |
Alternatively - you could provide an up gradation method as well - for example, while loading the older model for finetuning, the scale and bias blobs could be forced to one and zero respectively. I also saw that the shape of the blobs (global_mean, and global_variance) were also different - although they were of same size - this also creates problem. |
I agree that there is merit in what you did - by combining the normalization and scaling. Yet another easy fix is to make this a different layer - say "BatchNormScale", and keep the older layer as it is as a separate layer for backward compatibility. This is so far the simplest (minimal code changes) solution that I could come up with. |
Agree, Adding new layer would be the simplest way. |
So do you think you can do it the name change (and keep the old layer) right away? I would love to use this new layer with CUDNN as CUDNN gives me 2x boost in speed. |
Do you have any convergence issue with CUDNN BatchNorm used in this PR? In the nvidia/caffe, I had to set the engine as CAFFE for BatchNorm to get convergence. I was struggling with the convergence issue, but finally the following worked for me. Specifying the engine as CAFFE is important. CUDNN BatchNorm doesn't converge for me. The following is the configuration that I used in nvidia/caffe version. I am posting it here because I think the underlying implementation is same. layer { |
If the CUDNN BatchNorm converged, that would have given me overall 4x boost in speed, but now with the CAFFE engine for BatchNorm, I get only 2x boost in speed overall! |
I have couple of more comments:
These changes will help someone who is trying to use this layer for the first time - apart from saving some space in the prototxt. |
Need urgent help - any suggestions to solve the issue of non convergence with CUDNN BatchNorm? Are CUDNN BatchNorm implementations same in nvcaffe-0.15, nvcaffe-0.16 and this PR? (I tried nvcaffe-0.16) Any suggestions would help. |
I will re-check the convergence issue with cuDNN_BN. On Mon, Sep 26, 2016 at 12:06 AM, mathmanu [email protected] wrote:
|
The idea to change parameter order is very good. I will do this. On Mon, Sep 26, 2016 at 7:55 AM, Boris Ginsburg [email protected]
|
Thanks. How about the fillers - can you provide default values for them too? |
When you check the convergence issue with cuDNN_BN, kindly do so with a network that has many BN layers. For example ResNet18. |
can you send me your solver.ptototxt and train_test.prototxt files for the On Mon, Sep 26, 2016 at 8:30 AM, mathmanu [email protected] wrote:
|
I couldn't attach for some reason. I have copied the combined (solver + train) prototxt below. If you change the BatchNorm engine to CAFFE, it will start to converge. #Sover parameters #Net parameters name: "ResNet-18(1024)" layer { transform_param { layer { transform_param { layer { } |
Were you able to reproduce the behavior? |
I was able to reproduce the problem: caffe engine is converging, but cudnn On Tue, Sep 27, 2016 at 9:24 AM, mathmanu [email protected] wrote:
|
Thankyou so much for confirming. I'll wait for further information from you. |
Btw, I can see that the additional parameters that you mentioned have default values - so technically I don't need to specify them. Do the values suggested by you produce better accuracy? // How much does the moving average decay each iteration? |
Is there a quick fix or a workaround that can solve this issue? |
Yes. CUDNN_BN requires different top and bottom since it needs bottom for backward. I attached the fixed train_val.prototxt (I also did a few additional minor changes: change last layer from conv 1x1 to IP with num_of_outputs=1000). |
Thanks. This helped a lot. Btw, I have an observation. I have a network trained with the CAFFE BN engine. When I tried to TEST it using CUDNN BN engine, Caffe exited saying that the shapes of blobs in BN mismatch. But since the blobs are same size (but different shapes), I was able to forcefully reshape the blobs and do the TEST - and it gave correct results! |
My bug :( On Thu, Sep 29, 2016 at 7:29 PM, mathmanu [email protected] wrote:
|
Don't worry - CUDNN is great - it gives me 4x speed boost - that makes a huge difference. You just need to do slight changes and testing - cross compatibility test to CAFE engine and backward compatibility test. |
See another thread with a similar issue being reported: NVIDIA/DIGITS#629 |
@borisgin Out of curiosity, have you tried larger networks than Resnet-18 on NVIDIA? Your resnet-18 is the only one that converges for me while I can't find a single example of a larger Resnet that does. |
This PR adds two non-separatable features to Caffe: high performance CUB Memory Manager and long awaited upgrade from cuDNN v3 to cuDNN v4 (and upcoming v5) libraries.