You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
R1: The model requires additional supervision for training the attribute and relation proposal models.
First, although other state-of-the-art models didn’t explicitly model attributes, the bottom-up features they use still require attribute supervision for training the object detector.
Second, although our semantic relationships indeed require additional supervision, our geometry relationships are computed from region positions and require no supervision. Note that even without the semantic relationships, our OAR^G model also significantly outperforms Base model (improving CIDER from 120.8 to 127.2).
R1: Which network does "CNN is kept fixed and not finetuned" refer to?
It refers to Faster-RCNN. "We use the pre-trained Faster R-CNN model provided by [2]", as said in our paper, and we keep it fixed during training process.
R1: The accuracy of attribute and relationship detectors
On Visual Genome dataset, our attribute classifier gets a top-1 accuracy of 80.5%, and our relationship detector gets a recall@20 of 51.6%.
R2: How define better or worse results given 7(4) criteria.
Among the five metrics, BLEU, METEOR, ROUGE are from machine translation and document summarization, and CIDEr and SPICE are specific to image captioning. CIDEr is also the default sorting metric on COCO leaderboard. Accordingly, we combine all metrics to judge the performance of a model and pay more attention on CIDEr when two models are comparable.
In Table-1(d) of our paper, removing geometry cues from object features (while the geometry relationships are NOT removed) increases ROUGE by 0.2 points while decreases CIDEr by 0.5 points. We thought the performance is slightly worse considering the larger decrease on CIDEr.
R3: The results of training with cross-entropy loss.
The results of training Base and VSUA with cross-entropy loss (VSUA-XE and Base-XE) are shown in Table-1.
Table 1: The results on MS-COCO Karpathy test split.
R3, R4: Releasing code?
We will release the code and pre-processed data upon paper acceptance.
R4: How much slower does it run compared to Base model?
The mean inference times of Base and VSUA are 13.4 ms and 19.7 ms, respectively.
R4: Ensemble results for Base and VSUA.
For both models, we train five models with different seeds and sort them by CIDEr in descending order. CIDEr of combining 1->2->3->4->5 models are 120.8->121.2->121.4->121.8->122.1 and 128.7->129.0->129.4->129.6->129.6 for Base and VSUA, respectively.
R4: Related works that use GCNs...
We will cite and discuss them in the revision.
R4: Will adding additional grid-level feature (e.g. ResNet-152) help?
We add ResNet-152's features as additional inputs to our Context Gated Attention Fusion module, denoted as VSUA+Res152. As shown in Table-1, VSUA+Res152 outperforms VSUA by 0.7 CIDEr.
R4: Why OA is doing better than Base?
In OA, O, and A models, the higher-level semantic information (objects and attributes) of each region is explicitly considered, which is less noisy than image features.
The text was updated successfully, but these errors were encountered:
Sincerely thank the reviewers’ valuable comments.
R1: The model requires additional supervision for training the attribute and relation proposal models.
First, although other state-of-the-art models didn’t explicitly model attributes, the bottom-up features they use still require attribute supervision for training the object detector.
Second, although our semantic relationships indeed require additional supervision, our geometry relationships are computed from region positions and require no supervision. Note that even without the semantic relationships, our OAR^G model also significantly outperforms Base model (improving CIDER from 120.8 to 127.2).
R1: Which network does "CNN is kept fixed and not finetuned" refer to?
It refers to Faster-RCNN. "We use the pre-trained Faster R-CNN model provided by [2]", as said in our paper, and we keep it fixed during training process.
R1: The accuracy of attribute and relationship detectors
On Visual Genome dataset, our attribute classifier gets a top-1 accuracy of 80.5%, and our relationship detector gets a recall@20 of 51.6%.
R2: How define better or worse results given 7(4) criteria.
Among the five metrics, BLEU, METEOR, ROUGE are from machine translation and document summarization, and CIDEr and SPICE are specific to image captioning. CIDEr is also the default sorting metric on COCO leaderboard. Accordingly, we combine all metrics to judge the performance of a model and pay more attention on CIDEr when two models are comparable.
In Table-1(d) of our paper, removing geometry cues from object features (while the geometry relationships are NOT removed) increases ROUGE by 0.2 points while decreases CIDEr by 0.5 points. We thought the performance is slightly worse considering the larger decrease on CIDEr.
R3: The results of training with cross-entropy loss.
The results of training Base and VSUA with cross-entropy loss (VSUA-XE and Base-XE) are shown in Table-1.
Table 1: The results on MS-COCO Karpathy test split.
Metrics, BLEU@4, METEOR, ROUGE-L, CIDEr, SPICE
Base-XE, 36.1, 27.0, 56.5, 113.3, 20.3
Base, 36.7, 27.9, 57.1, 120.8, 20.9
VSUA-XE, 36.7, 27.8, 57.0, 115.8, 20.8
VSUA, 38.4, 28.5, 58.4, 128.6, 22.0
VSUA+Res152, 38.6, 28.5, 58.4, 129.3, 22.1
R3, R4: Releasing code?
We will release the code and pre-processed data upon paper acceptance.
R4: How much slower does it run compared to Base model?
The mean inference times of Base and VSUA are 13.4 ms and 19.7 ms, respectively.
R4: Ensemble results for Base and VSUA.
For both models, we train five models with different seeds and sort them by CIDEr in descending order. CIDEr of combining 1->2->3->4->5 models are 120.8->121.2->121.4->121.8->122.1 and 128.7->129.0->129.4->129.6->129.6 for Base and VSUA, respectively.
R4: Related works that use GCNs...
We will cite and discuss them in the revision.
R4: Will adding additional grid-level feature (e.g. ResNet-152) help?
We add ResNet-152's features as additional inputs to our Context Gated Attention Fusion module, denoted as VSUA+Res152. As shown in Table-1, VSUA+Res152 outperforms VSUA by 0.7 CIDEr.
R4: Why OA is doing better than Base?
In OA, O, and A models, the higher-level semantic information (objects and attributes) of each region is explicitly considered, which is less noisy than image features.
The text was updated successfully, but these errors were encountered: