Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sik-Ho Tang | Review -- MoCo v3: An Empirical Study of Training Self-Supervised Vision Transformers. #140

Open
NorbertZheng opened this issue Dec 18, 2023 · 5 comments

Comments

@NorbertZheng
Copy link
Owner

Sik-Ho Tang. Review — MoCo v3: An Empirical Study of Training Self-Supervised Vision Transformers.

@NorbertZheng
Copy link
Owner Author

Overview

Instability Study of ViT for Self-Supervised Learning.
image
Vision Transformer (ViT) Network Architecture (Figure from ViT).

An Empirical Study of Training Self-Supervised Vision Transformers
MoCo v3, by Facebook AI Research (FAIR)
2021 ICCV, Over 100 Citations
Self-Supervised Learning, Unsupervised Learning, Contrastive Learning, Representation Learning, Image Classification, Vision Transformer (ViT)

MoCo v3 is an incremental improvement of MoCo v1/MoCo v2, studying the instability issue when ViT is used for self-supervised learning.

@NorbertZheng
Copy link
Owner Author

MoCo v3 Using ResNet (Before Using ViT)

image
MoCo v3: PyTorch-like Pseudocode.

  • As the same in MoCo v2, InfoNCE used in CPC, is used as the loss function for training. Larger batch is used to include more negative samples.
  • Different from MoCo v2, in MoCo v3, the keys that used naturally co-exists in the same batch. The memory queue (memory bank) is abandoned. Thus, this setting is the same as SimCLR.
  • The encoder $f_{q}$ consists of a backbone (e.g., ResNet, ViT), a projection head [10], and an extra prediction head [18].
  • The encoder $f_{k}$ has the backbone and projection head, but not the prediction head. $f_{k}$ is updated by the moving-average of $f_{q}$ as in MoCo, excluding the prediction head.

image
The linear probing accuracy with ResNet-50 (R50) on ImageNet.

By using ResNet-50, the improvement here is mainly due to the extra prediction head and large-batch (4096) training.

@NorbertZheng
Copy link
Owner Author

Stability Study for Basic Factors When Using ViT

It is straightforward to replace a ResNet backbone with a ViT backbone. But in practice, a main challenge is the instability of training.

Batch Size

image
Training curves of different batch sizes.

A larger batch is also beneficial for accuraacy. A batch of 1k and 2k produces reasonably smooth curves, with 71.5% and 72.6% linear probing accuracy.

The curve of a 4k batch becomes noticeably unstable: see the "dips". The curve of a 6k batch has worse failure patterns.

Learning Rate

image
Training curves of different learning rates.

When $lr$ is smaller, the training is more stable, but it is prone to under-fitting.

$lr$=1.5e-4 for this setting has more dips in the curve, and its accuracy is lower. In this regime, the accuracy is determined by stability.

Optimizer

image
Training curves of LAMB optimizer.

  • AdamW is the default optimizer. LAMB is studied.
  • Although LAMB can avoid sudden change in the gradients, the negative impact of unreliable gradients is accumulated.

As a result, authors opt to use AdamW.

@NorbertZheng
Copy link
Owner Author

Tricks for Improving Stability

Random Patch Projection

image
Gradient magnitude, shown as relative values for the layer.

It is found that a sudden change of gradients (a "spike") causes a "dip" in the training curve.

The gradient spikes happen earlier in the first layer (patch projection), and are delayed by couples of iterations in the last layers.

Better with residual connection when patch projection???

image
Random vs. learned patch projection.

The instability happens earlier in the shallower layers.

  • The patch projection layer in frozen during training. In other words, a fixed random patch projection layer is used to embed the patches.
  • Random patch projection stabilizes training, with smoother and better training curve. This stability benefits the final accuracy, boosting the accuracy by 1.7% to 73.4%.

Random Patch Projection on SimCLR and BYOL

image
Random vs. learned patch projection on SimCLR and BYOL.

Random patch projection improves stability in both SimCLR and BYOL, and increases the accuracy by 0.8% and 1.3%.

@NorbertZheng
Copy link
Owner Author

Experimental Results

Models

image
Configurations of ViT models.

Training Time

image
It takes 2.1 hours of training ViT-B for 100 epochs. ViT-H takes 9.8 hours per 100 epochs using 512 TPUs.

Self-Supervised Learning Frameworks

image
ViT-S/16 and ViT-B/16 in different self-supervised learning frameworks (ImageNet, linear probing).

  • MoCo v3 has better accuracy on ViT than other frameworks.

image
Different Backbones and different framworks.

  • MoCo v3 and SimCLR are more favorable for ViT-B than R50.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant