This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Why LayerNorm before conv in downsampling layers ? #132

Open

F-Barto opened this issue Nov 6, 2022 · 0 comments

F-Barto commented Nov 6, 2022 •

edited

Loading

Thanks for your awesome work!

While stem is coherent in regard to Blocks where we have the ordering conv->norm, in dowsampling layers you put LayerNorm before convolution.

The full path is:

conv2d 4x4, stride 4
layernorm
residual stage 1

layernorm
conv2d 2x2, stride 2
residual stage 2

layernorm
conv2d 2x2, stride 2
residual stage 3

layernorm
conv2d 2x2, stride 2
residual stage 4

Which means that if residual stage 1 converges to identity, we have a layernorm into a layernorm which seems weird to me:

conv2d 4x4, stride 4
layernorm
layernorm
conv2d 2x2, stride 2

Can you explain this design choice ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.