BatchNorm after ReLU #5

ducha-aiki · 2016-01-31T17:24:59Z

Hi,

I am performing somehow similar benchmark, but on caffenet128 (and moving to ResNets now) on ImageNet.
One thing, that I have found - the best position of BN in non-res net is after ReLU and without scale+bias layer (https://github.com/ducha-aiki/caffenet-benchmark/blob/master/batchnorm.md):

Name	Accuracy	LogLoss	Comments
Before	0.474	2.35	As in paper
Before + scale&bias layer	0.478	2.33	As in paper
After	0.499	2.21
After + scale&bias layer	0.493	2.24

May be, it is worth testing too.

Second, results on CIFAR-10 often contradicts results on ImageNet. I.e., leaky ReLU > ReLU on CIFAR, but worse on ImageNet.

P.S. We could cooperate in ImageNet testing, if you agree.

gcr · 2016-02-01T14:31:21Z

Oh, interesting! I'll add a link to this issue in the README, if you don't mind.

What is the 'scale&bias layer'? In Torch, batch normalization layers have learnable weight and bias parameters that correspond with β,γ in the Batch Norm paper. Is that what you mean?

ducha-aiki · 2016-02-01T15:14:11Z

Yes, β and γ. In caffe BatchNorm is split into batchnorm layer and learnable affine params layer.

ducha-aiki · 2016-02-01T15:19:38Z

On Imagenet, @ducha-aiki found the opposite effect from the CIFAR results above. Putting batch normalization after the residual layer seems to improve results on Imagenet.

That is not correct, I have done batchnorm experiments on plain, non-residual nets only so far :) The batchnorm ResNets are in training. And the "ThinResNet-101" from my benchmark does not use batchnorm at all - as baseline.

gcr · 2016-02-01T15:43:09Z

Oh I guess I misunderstood, pardon. So this experiment was on an ordinary Caffenet, not a residual network?

ducha-aiki · 2016-02-01T15:45:34Z

Yes.

gcr · 2016-02-01T15:47:07Z

Thanks, that makes sense. It's interesting because it challenges the commonly-held assumption that batch norm before ReLU is better than after. I'd be interested to see how much of an impact the residual network architecture has on ImageNet---the harder the task, the more of an effect different architectures seem to have.

ducha-aiki · 2016-02-01T16:19:22Z

commonly-held assumption that batch norm before ReLU is better than after.

I never understand this from original paper, because sense of data whitening is normalization of layer input, and ReLU output is usually input for next layer.

cgarciae · 2017-06-27T21:40:27Z

@ducha-aiki The paper reads:

In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that is “more Gaussian", normalizing it is likely to produce activation with a stable distribution.

I get from this that its better to batch-normalize the linear function since its more likely to behave like a normal distribution (from which the method is derived), especially for cases like the ReLU function which is asymmetric.

benanne mentioned this issue Nov 27, 2016

cifar-10 inception example Lasagne/Recipes#89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BatchNorm after ReLU #5

BatchNorm after ReLU #5

ducha-aiki commented Jan 31, 2016

gcr commented Feb 1, 2016

ducha-aiki commented Feb 1, 2016

ducha-aiki commented Feb 1, 2016

gcr commented Feb 1, 2016

ducha-aiki commented Feb 1, 2016

gcr commented Feb 1, 2016

ducha-aiki commented Feb 1, 2016

cgarciae commented Jun 27, 2017 •

edited

Loading

BatchNorm after ReLU #5

BatchNorm after ReLU #5

Comments

ducha-aiki commented Jan 31, 2016

gcr commented Feb 1, 2016

ducha-aiki commented Feb 1, 2016

ducha-aiki commented Feb 1, 2016

gcr commented Feb 1, 2016

ducha-aiki commented Feb 1, 2016

gcr commented Feb 1, 2016

ducha-aiki commented Feb 1, 2016

cgarciae commented Jun 27, 2017 • edited Loading

cgarciae commented Jun 27, 2017 •

edited

Loading