This is code for our paper "Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering". Bilaterally Slimmable Transformer (BST) is a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. In this repo, we integrate the proposed BST framework with the MCAN model and train it on VQA-v2 and GQA datasets. The code is based on the OpenVQA repo.
Please follow the openvqa documentation to install the required environment.
- Image Features
Please follow the openvqa documentation to prepare the image features for VQA-v2 and GQA datasets.
- QA Annotations
Please download our re-splited QA annotations for VQA-v2 and GQA datasets and unzip them to the specified directory.
# for VQA-v2 QA annotations
$ unzip -d ./data/vqa/raw/ vqav2-data.zip
# for GQA QA annotations
$ unzip -d ./data/gqa/raw/ gqa-data.zip
Please download the VQA-v2 and GQA teacher model weights for BST training and put them in . /ckpts/teacher_model_weights/
directory.
After preparing the datasets and teacher model weights, the project directory structure should look like this:
|-- data
|-- vqa
| |-- feats
| | |-- train2014
| | |-- val2014
| | |-- test2015
| |-- raw
|-- gqa
| |-- feats
| | |-- gqa-frcn
| |-- raw
|-- ckpts
|-- teacher_model_weights
| |-- vqav2_teacher_epoch14.pkl
| |-- gqa_teacher_epoch11.pkl
Note that if you only want to run experiments on one specific dataset, you can focus on the setup for that and skip the rest. For example, if you just want to run experiments on VQA-v2 dataset, you can only prepare the VQA-v2 dataset and vqa teacher model weights.
The following script will start training a mcan_bst
model:
# train on VQA-v2 dataset
$ python3 run.py --RUN='train' --DATASET='vqa' --SPLIT='train+vg' --GPU=<str> --VERSION=<str>
# train on GQA dataset
$ python3 run.py --RUN='train' --DATASET='gqa' --SPLIT='train+val' --GPU=<str> --VERSION=<str>
--GPU
, e.g.,--GPU='0'
, to train the model on specified GPU device.--VERSION
, e.g.,--VERSION='your_version_name'
, to assign a name for your this experiment.
All checkpoint files will be saved to ckpts/ckpt_<VERSION>/epoch<EPOCH_NUMBER>.pkl
and the training log file will be placed at results/log/log_run_<VERSION>.txt
.
For VQA-v2, if you want to evaluated on the test-dev sets, please run the following script to generate the result file first:
$ python3 run.py --RUN='test' --MODEL='mcan_bst' --DATASET='vqa' --CKPT_V='your_vqa_version_name' --GPU='0' --CKPT_E=<int> --WIDTH=<str> --DEPTH=<str>
--CKPT_E
, e.g.,--CKPT_E=15
, to specify the epoch number (usually the last is the best).--WIDTH
, e.g.,--WIDTH='1'
, to specify the submodel width for inference, the candidate width set is{'1/4', '1/2', '3/4', '1'}
.--DEPTH
, e.g.,--DEPTH='1'
, to specify the submodel depth for inference, the candidate depth set is{'1/6', '1/3', '2/3', '1'}
.
If --WIDTH
or --DEPTH
is not specified, inference will be performed on all submodels. Finally, result file is saved at: results/result_test/result_run_<CKPT_V>_<WIDTH>_<DEPTH>_<CKPT_E>.json
, upload it to VQA Challenge Page to get the scores on test-dev set.
For GQA, you can evaluate on local machine directly to get the scores on test-dev set, the script is:
$ python3 run.py --RUN='val' --MODEL='mcan_bst' --DATASET='gqa' --CKPT_V='your_gqa_version_name' --GPU='0' --CKPT_E=<int> --WIDTH=<str> --DEPTH=<str>
We also provide the checkpoint models on the VQA-v2 and GQA datasets to reproduce the following results on test-dev
using the testing script above.
model | VQA-v2 (ckpt) | GQA (ckpt) |
---|---|---|
MCANBST (D, L) | 71.04 | 58.37 |
MCANBST (1/2D, L) | 70.48 | 57.78 |
MCANBST (1/2D, 1/3L) | 69.53 | 57.34 |
MCANBST (1/4D, 1/3L) | 68.13 | 56.69 |
This project is released under the Apache 2.0 license.
If you use this code in your research, please cite our paper:
@article{yu2023bst,
title={Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering},
author={Yu, Zhou and Jin, Zitian and Yu, Jun and Xu, Mingliang and Wang, Hongbo and Fan, Jianping},
journal={IEEE Transactions on Multimedia},
url={10.1109/TMM.2023.3254205},
year={2023}
}