-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about DenseCLIP for Any Visual Backbone #47
Comments
Congradulations on your great work! @raoyongming |
If I use Swintransformer-T as the image encoder,the output image feature is [B, 768, 16, 12]. Is the attention pooling layer used to map image features to the embedded space([B,512,16,12]), then calculate similarity with text features? Can I replace it with a linear layer? |
Yes, we use a randomly initialized attention pooling layer to map the image features into the embedding space. It might be okay to use a simpler linear layer but we haven't tried it in our experiments |
|
@raoyongming 您好,请问您在做 any visual backbone 实验时,有做ImageNet pre-trained vit 的实验吗?我尝试了一下使用ImageNet pre-trained vit进行实验,结果没有提升,请问您觉得是什么原因呢? |
你好,我们只在论文里面report的ResNet和Swin上做过实验。 |
No description provided.
The text was updated successfully, but these errors were encountered: