You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here: if I understand your code correctly, you use the FC7 layer output of a pretrained VGG net as input to your model. However, your model has another trainable layer to compute the embedding from FC7. Is that correct? Can't you just use FC7 as the embedding layer?
The text was updated successfully, but these errors were encountered:
I am not sure what the authors were thinking, but from my view, there could
be two reasons.
In Show and Tell, you can consider that the image is like the first word
of a sentence. So you might want to embed the image feature vector once
more so that it is embedded into semantic(text) space.
The dimension of FC7 (4,096) is too large. Since the image and semantic
vector need to have same dimension, you should make the semantic vector
4,096D as well, which leads to too many weight parameters to train.
Here: if I understand your code correctly, you use the
FC7
layer output of a pretrained VGG net as input to your model. However, your model has another trainable layer to compute the embedding fromFC7
. Is that correct? Can't you just useFC7
as the embedding layer?The text was updated successfully, but these errors were encountered: