Question 1x1 conv vs linear #18

KatsarosEf · 2022-01-17T09:10:43Z

Congratulations on your work and thanks for sharing! I'd like to naively ask, what is the reason behind implementing 1x1 convs with fully connected layers? I know they are equivalent but I had been thinking the latter is less efficient.

Thanks in advance!

liuzhuang13 · 2022-01-17T10:58:15Z

Update: I retested the comparison and it seems the info I gave before is not exactly accurate (I remembered wrong, so I deleted it to avoid misleading anyone). Sorry about the confusion. I give my latest test observations on V100 GPU inference throughput below:

Testing standalone MLPs, (NHWC -> linear layers) is ~5% faster than (NCHW -> 1x1 convs). This is tested with C=256, feature resolution=14, batch size 128.
Tested in ConvNeXts, with LN disabled in blocks, using (NCHW -> 1x1 convs -> layerscale) is 0-5% (depending on resolution and model sizes) faster than (NCHW -> permute to NHWC -> linear layers -> layer scale-> permute back to NCHW). This reverse could be partly due to permutation, or other whole-model related properties. Interestingly if I use "channel_last" in PyTorch sometimes the later is faster at 384 resolution.
Tested in ConvNeXts, with LN included in blocks, using (NCHW -> custom LN -> 1x1 convs -> layer scale) is now 0-5% slower than (NCHW -> permute to NHWC -> PyTorch LN -> linear layers -> layer scale -> permute back to NCHW). The custom LN is one we wrote that operates on NCHW tensors (PyTorch's LN only supports tensors with C as the last dimension).

Looking at 2 and 3, the ultimate reason why (NCHW -> permute to NHWC -> PyTorch LN -> linear layers -> layer scale -> permute back to NCHW) is slightly faster than (NCHW -> custom LN -> 1x1 convs -> layer scale), seems to be our custom LN layer operating on NCHW tensors is much slower than the PyTorch's LN that only supports operating on NHWC tensors.

So we need the permutation to NHWC anyway to use PyTorch's LN, and given the observation in 1 (without permutation linear is faster than 1x1 convs), we use linear layers before permuting it back to do the "MLP" part.

KatsarosEf · 2022-01-17T12:07:57Z

Thank you for your swift response, very detailed of an architectural design build, indeed I only noticed your comment on line 30 :). One last question if you don't mind and I am closing this issue, the choice of GeLU over ReLU is due to some dying neurons observations or was solely chosen based on the related Transformers' papers (BERT, GPT2) as mentionned? Is there a case you experimented with alternatives like Swish?

Many thanks again, all the best to your future works.

liuzhuang13 · 2022-01-17T19:17:30Z

The choice of GELU over RELU is in part due to imitating Transformers. Another interesting observation is if we stick to RELU, in the next step "Fewer activations" the training curve becomes a bit strange, despite it can converge to a reasonable level finally. We didn't try activations other than RELU and GELU

KatsarosEf closed this as completed Jan 17, 2022

liuzhuang13 mentioned this issue Jan 17, 2022

Adding ConvNeXt architecture in prototype pytorch/vision#5197

Merged

MengzhangLI mentioned this issue Jan 19, 2022

[Feature] Support ConvNeXt open-mmlab/mmsegmentation#1216

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question 1x1 conv vs linear #18

Question 1x1 conv vs linear #18

KatsarosEf commented Jan 17, 2022

liuzhuang13 commented Jan 17, 2022 •

edited

Loading

KatsarosEf commented Jan 17, 2022

liuzhuang13 commented Jan 17, 2022

Question 1x1 conv vs linear #18

Question 1x1 conv vs linear #18

Comments

KatsarosEf commented Jan 17, 2022

liuzhuang13 commented Jan 17, 2022 • edited Loading

KatsarosEf commented Jan 17, 2022

liuzhuang13 commented Jan 17, 2022

liuzhuang13 commented Jan 17, 2022 •

edited

Loading