Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sik-Ho Tang | Review -- BEiT: BERT Pre-Training of Image Transformers. #79

Open
NorbertZheng opened this issue Apr 4, 2023 · 7 comments

Comments

@NorbertZheng
Copy link
Owner

Sik-Ho Tang. Review — BEiT: BERT Pre-Training of Image Transformers.

@NorbertZheng
Copy link
Owner Author

Overview

BEiT, Pretraining ViT, Using Masked Image Modeling (MIM).

BEiT: BERT Pre-Training of Image Transformers. BEiT, by Microsoft Research. 2022 ICLR, Over 300 Citations.

Self-Supervised Learning, BERT, Transformer, Vision Transformer, ViT, DALL·E.

  • Bidirectional Encoder representation from Image Transformers (BEiT) is proposed, where a masked image modeling (MIM) task to pretrain Vision Transformers.
  • BEiT first “tokenizes” the original image into visual tokens. Then some image patches are randomly masked and fed into the backbone Transformer.
  • The pre-training objective is to recover the original visual tokens based on the corrupted image.

@NorbertZheng
Copy link
Owner Author

BEiT Architecture

image
Overview of BEiT pre-training.

Overall Approach

  • Inspired by BERT, a pre-training task is proposed, namely, masked image modeling (MIM).
  • MIM uses two views for each images, i.e., image patches, and visual tokens.
  • The image is split into a grid of patches that are the input representation of backbone Transformer.
  • The image is “tokenized” to discrete visual tokens by the latent codes of discrete VAE, where discrete VAE is from DALL·E.

During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer.

  • The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.

@NorbertZheng
Copy link
Owner Author

Image Representation

During pre-training, the images have two views of representations, namely,

  • image patch: input representations,
  • visual tokens output representations.

Image Patches

image
Image Patches (Cut from the first figure).

  • The 2D image of the size $H\times W\times C$ is split into a sequence of patches $x_{p}$ ($p$ is from $1$ to $N$) of the size $P^{2}$, with the number of patch $N=\frac{HW}{P^{2}}$ patches.
  • The image patches $x_{p}$ are flattened into vectors and are linearly projected which is similar to word embeddings in BERT.

Particularly, BEiT splits each $224\times 224$ image into a $14\times 14$ grid of image patches, where each patch is $16\times 16$.

Visual Tokens

image
Visual Tokens (Cut from the first figure).

  • The image is represented as a sequence of discrete tokens obtained by an “image tokenizer”, instead of raw pixels.

Specifically, the image of the size $H\times W\times C$ is tokenized into $z=[z_{1},…,z_{N}]$, where the vocabulary $V={1,…,|V|}$ contains discrete token indices.

  • The image tokenizer learned by discrete variational autoencoder (dVAE), by DALL·E, is directly used.
  • There are two modules during visual token learning, namely, tokenizer and decoder.
  • The tokenizer $q(z|x)$ maps image pixels $x$ into discrete tokens $z$ according to a visual codebook (i.e., vocabulary).
  • The decoder $p(x|z)$ learns to reconstruct the input image $x$ based on the visual tokens $z$.
  • The vocabulary size is set to $|V| = 8192$.

@NorbertZheng
Copy link
Owner Author

ViT Backbone

  • Following ViT, the Transformer backbone network is used.
  • The input of Transformer is a sequence of image patches $x_{i}^{p}$.
  • The patches are then linearly projected to obtain patch embeddings $Ex_{i}^{p}$.
  • The standard learnable 1D position embeddings $E_{pos}$ are added to patch embeddings:
    image
  • The encoder contains $L$ layers of Transformer blocks:
    image
  • The output vectors of the last layer is:
    image
    which are used as the encoded representations for the image patches, where $h_{i}^{L}$ is the vector of the $i$-th image patch.
  • ViTBase is used, which is a 12-layer Transformer with 768 hidden size, and 12 attention heads. The intermediate size of feed-forward networks is 3072.

@NorbertZheng
Copy link
Owner Author

BEiT Pretraining: Masked Image Modeling (MIM)

Masked Image Modeling (MIM)

image
BEiT Masked Image Modeling (MIM) (Cut from the first figure).

  • After splitting the image into image patches, as described above, approximately 40% image patches are randomly masked, where the masked positions are denoted as $M$. The masked patches are replaced with a learnable embedding $e[M]$. In BEiT, At most 75 patches are masked.
  • Then, the good and masked image patches are input into the L-layer Transformer.
  • A softmax classifier is used to predict the corresponding visual tokens:
    image

The pre-training objective is to maximize the log-likelihood of the correct visual tokens $z_{i}$ given the corrupted image:
image

  • BEiT is pretrained on the training set of ImageNet-1K.
  • The pre-training runs for about 500k steps (i.e., 800 epochs) with 2k batch size. The 500k training steps take about five days using 16 Nvidia Tesla V100 32GB GPU cards.

@NorbertZheng
Copy link
Owner Author

Blockwise Masking

image
Blockwise Masking.

  • Blocks of patches are masked randomly as shown in the figure and algorithm above, instead of masking each patch individually in a random manner (not stable).

@NorbertZheng
Copy link
Owner Author

From VAE Perspective

  • The BEiT pre-training can be viewed as variational autoencoder training:
    image
  • In the first stage, the image tokenizer is obtained as a discrete variational autoencoder. Specifically, the first stage minimizes the reconstruction loss, with an uniform prior.
  • In the second stage, the prior $p_{\theta}$ is learnt while keeping $q_{\phi}$ and $q_{\Psi}$ fixed.
  • Thus, the above equation is re-written as:
    image
    where the second term is the proposed BEiT pre-training objective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant