Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Why LayerNorm before conv in downsampling layers ? #132

Open
F-Barto opened this issue Nov 6, 2022 · 0 comments
Open

Why LayerNorm before conv in downsampling layers ? #132

F-Barto opened this issue Nov 6, 2022 · 0 comments

Comments

@F-Barto
Copy link

F-Barto commented Nov 6, 2022

Thanks for your awesome work!

While stem is coherent in regard to Blocks where we have the ordering conv->norm, in dowsampling layers you put LayerNorm before convolution.

The full path is:

  • conv2d 4x4, stride 4
  • layernorm
  • residual stage 1
  • layernorm
  • conv2d 2x2, stride 2
  • residual stage 2
  • layernorm
  • conv2d 2x2, stride 2
  • residual stage 3
  • layernorm
  • conv2d 2x2, stride 2
  • residual stage 4

Which means that if residual stage 1 converges to identity, we have a layernorm into a layernorm which seems weird to me:

  • conv2d 4x4, stride 4
  • layernorm
  • layernorm
  • conv2d 2x2, stride 2

Can you explain this design choice ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant