-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sik-Ho Tang | Review -- BEiT: BERT Pre-Training of Image Transformers. #79
Comments
OverviewBEiT, Pretraining ViT, Using Masked Image Modeling (MIM). BEiT: BERT Pre-Training of Image Transformers. BEiT, by Microsoft Research. 2022 ICLR, Over 300 Citations. Self-Supervised Learning, BERT, Transformer, Vision Transformer, ViT, DALL·E.
|
BEiT Architecture
Overall Approach
During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer.
|
Image RepresentationDuring pre-training, the images have two views of representations, namely,
Image Patches
Particularly, BEiT splits each Visual Tokens
Specifically, the image of the size
|
ViT Backbone
|
BEiT Pretraining: Masked Image Modeling (MIM)Masked Image Modeling (MIM)
The pre-training objective is to maximize the log-likelihood of the correct visual tokens
|
Sik-Ho Tang. Review — BEiT: BERT Pre-Training of Image Transformers.
The text was updated successfully, but these errors were encountered: