Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on Masking ratio #47

Open
sukun1045 opened this issue Dec 15, 2023 · 5 comments
Open

Question on Masking ratio #47

sukun1045 opened this issue Dec 15, 2023 · 5 comments

Comments

@sukun1045
Copy link

Hi, thanks for sharing the Pytorch implementation! I am curious about how you select the stats for varied masking ratios. In the paper, you mentioned 'a truncated Gaussian distribution centered at 0.55, left truncated by 0.5, and right truncated by 1.' What is the motivation for using such a distribution? Why not use the cosine schedule as done in MaskGIT? Thank you!

@LTH14
Copy link
Owner

LTH14 commented Dec 16, 2023

Thanks for your interest! The masking ratio is left truncated by 0.5 so that we can always drop 50% of the input tokens in the ViT encoder, which largely saves computation (a similar idea as MAE). In Table 5 of the paper, we show ablations about the center and std of the Gaussian distribution. We also tried the cosine masking ratio scheduling similar to MaskGIT, and the performance is slightly worse.

@sukun1045
Copy link
Author

Thanks for your reply! I am also curious about the training convergence and finding the best model between variants. I think the best eval loss could be varied according to different masking strategies. How can I find the best masking strategy when I conduct these experiments? For example, if I choose the truncnorm with mu=0.55 and std=0.25, should I run the training until it converges, check the FID score, and then run another experiments?

@LTH14
Copy link
Owner

LTH14 commented Dec 19, 2023

Our evaluation protocol is based on both FID and linear probing accuracy -- once we train a model with certain hyper-parameters, we evaluate it on ImageNet and pick the best hyper-parameters based on FID and linear probing.

@sukun1045
Copy link
Author

Thanks again for your reply. Regarding linear probing, have you tried using CLS token output instead of average pooling the rest of the encoder output features? I saw that in your code, but I wondered how it might affect the performance.

@LTH14
Copy link
Owner

LTH14 commented Jan 2, 2024

We tried using CLS token. However, the performance is not very stable -- normally it achieves similar performance as average pooled features, but occasionally it gets very poor accuracy (~10%). Therefore we choose the global average pooling feature for stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants