Project of the 14th XingHuo Project Election
- python3
- torch
- torchvision
- nltk
- tqdm
- h5py
- transformers
usage: [-h] --data_root DATA_ROOT [--stage {train,test}] [--cuda] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--lr LR] [--load LOAD][--save_dir SAVE_DIR] [--save_freq SAVE_FREQ]
optional arguments:
-h, --help show this help message and exit
--data_root DATA_ROOT
root directory of dataset
--stage {train,test} model stage
--cuda enable cuda
--epochs EPOCHS total epochs to train
--batch_size BATCH_SIZE
mini-batch size (default: 64)
--lr LR learning rate
--load LOAD loading specific model checkpoint
--save_dir SAVE_DIR directory for saving model checkpoints
--save_freq SAVE_FREQ
number of iterations between two saving actions
# Examples
# 开始训练
python --root=/data/VisualGenome --stage=train --cuda
# 从已有 checkpoint 恢复
python --root=/data/VisualGenome --stage=train --cuda --load=./checkpoints/
本项目的数据集为 Visual Genome,我们在使用时进行了一定的整理。
我们按照 80% - 10% - 10% 的比例划分训练 / 开发 / 测试集。
其中关键数据文件 data.json
"train": {
"images": {
"[image_id]": {
"index": "[index_in_image_data.json]",
"path": "[local_image_path]",
"desc": [
"[description sentence 1]",
"[description sentence 2]",
"qas": [
"image_id": "[number]",
"question": "[question?]",
"answer": "[answer]"
"dev": {
"test": {
由于我们采用的图片特征提取 CNN (ResNet152) 在训练时会占用较多的 GPU 显存(batch_size 将受到限制) 以及增加额外的计算时间,因此我们预提取了数据集 (Visual Genome) 的图片特征,用 hdf5 格式保存在数据目录下。
# run `python scripts/ --help` for more information
python scripts/ --root=/path/to/your_data_root --cuda
- Ben-Younes, Hedi, et al. "Mutan: Multimodal tucker fusion for visual question answering." Proceedings of the IEEE international conference on computer vision. 2017.
- Antol, Stanislaw, et al. "Vqa: Visual question answering." Proceedings of the IEEE international conference on computer vision. 2015.
My advisor is Professor Xiaolin Hu.
This Project is modified and extended based on "Introduction to Deep Learning" course project.
My teammates are Bohan Chen (@acyume), Zhuoer Feng