Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchNorm after ReLU #5

Open
ducha-aiki opened this issue Jan 31, 2016 · 8 comments
Open

BatchNorm after ReLU #5

ducha-aiki opened this issue Jan 31, 2016 · 8 comments

Comments

@ducha-aiki
Copy link

Hi,

I am performing somehow similar benchmark, but on caffenet128 (and moving to ResNets now) on ImageNet.
One thing, that I have found - the best position of BN in non-res net is after ReLU and without scale+bias layer (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md):

Name Accuracy LogLoss Comments
Before 0.474 2.35 As in paper
Before + scale&bias layer 0.478 2.33 As in paper
After 0.499 2.21
After + scale&bias layer 0.493 2.24

May be, it is worth testing too.

Second, results on CIFAR-10 often contradicts results on ImageNet. I.e., leaky ReLU > ReLU on CIFAR, but worse on ImageNet.

P.S. We could cooperate in ImageNet testing, if you agree.

@gcr
Copy link
Owner

gcr commented Feb 1, 2016

Oh, interesting! I'll add a link to this issue in the README, if you don't mind.

What is the 'scale&bias layer'? In Torch, batch normalization layers have learnable weight and bias parameters that correspond with β,γ in the Batch Norm paper. Is that what you mean?

@ducha-aiki
Copy link
Author

Yes, β and γ. In caffe BatchNorm is split into batchnorm layer and learnable affine params layer.

@ducha-aiki
Copy link
Author

On Imagenet, @ducha-aiki found the opposite effect from the CIFAR results above. Putting batch normalization after the residual layer seems to improve results on Imagenet.

That is not correct, I have done batchnorm experiments on plain, non-residual nets only so far :) The batchnorm ResNets are in training. And the "ThinResNet-101" from my benchmark does not use batchnorm at all - as baseline.

@gcr
Copy link
Owner

gcr commented Feb 1, 2016

Oh I guess I misunderstood, pardon. So this experiment was on an ordinary Caffenet, not a residual network?

@ducha-aiki
Copy link
Author

Yes.

@gcr
Copy link
Owner

gcr commented Feb 1, 2016

Thanks, that makes sense. It's interesting because it challenges the commonly-held assumption that batch norm before ReLU is better than after. I'd be interested to see how much of an impact the residual network architecture has on ImageNet---the harder the task, the more of an effect different architectures seem to have.

@ducha-aiki
Copy link
Author

commonly-held assumption that batch norm before ReLU is better than after.

I never understand this from original paper, because sense of data whitening is normalization of layer input, and ReLU output is usually input for next layer.

@cgarciae
Copy link

cgarciae commented Jun 27, 2017

@ducha-aiki The paper reads:

In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian", normalizing it is likely to produce activation with a stable distribution.

I get from this that its better to batch-normalize the linear function since its more likely to behave like a normal distribution (from which the method is derived), especially for cases like the ReLU function which is asymmetric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants