Questions #2

zer0n · 2016-01-24T11:17:29Z

Here: if I understand your code correctly, you use the FC7 layer output of a pretrained VGG net as input to your model. However, your model has another trainable layer to compute the embedding from FC7. Is that correct? Can't you just use FC7 as the embedding layer?

The text was updated successfully, but these errors were encountered:

jazzsaxmafia · 2016-01-24T11:30:07Z

I am not sure what the authors were thinking, but from my view, there could
be two reasons.

In Show and Tell, you can consider that the image is like the first word
of a sentence. So you might want to embed the image feature vector once
more so that it is embedded into semantic(text) space.
The dimension of FC7 (4,096) is too large. Since the image and semantic
vector need to have same dimension, you should make the semantic vector
4,096D as well, which leads to too many weight parameters to train.

-Taeksoo

2016-01-24 20:17 GMT+09:00 Kenneth Tran [email protected]:

Here
https://github.com/jazzsaxmafia/show_and_tell.tensorflow/blob/master/model.py#L55:
if I understand your code correctly, you use the FC7 layer output of a
pretrained VGG net as input to your model. However, your model has another
trainable layer to compute the embedding from FC7. Is that correct? Can't
you just use FC7 as the embedding layer?

—
Reply to this email directly or view it on GitHub
#2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions #2

Questions #2

zer0n commented Jan 24, 2016

jazzsaxmafia commented Jan 24, 2016

Questions #2

Questions #2

Comments

zer0n commented Jan 24, 2016

jazzsaxmafia commented Jan 24, 2016