This code is mainly based on https://github.com/MILVLG/bottom-up-attention.pytorch. Please see below for installation requirements (Section Requirements and Installation). You need to install the specific detectron2 version as specified in the repository, and apex.
Additionally, you need to download the pre-trained model. The one used for the meme challenge is https://awma1-my.sharepoint.com/:u:/g/personal/yuz_l0_tn/EaXvCC3WjtlLvvEfLr3oa8UBLA21tcLh4L8YLbYXl6jgjg?e=SFMoeu. Place it in the same directory as this README.
Once everything is set up, run the following two commands for extracting the features:
python extract_features.py --mode caffe --config-file configs/bua-caffe/extract-bua-caffe-r101-box-only.yaml --image-dir ../../data/img/ --out-dir ../../data/own_features_bbox/ --resume
python extract_features.py --mode caffe --config-file configs/bua-caffe/extract-bua-caffe-r101-gt-bbox.yaml --image-dir ../../data/img/ --gt-bbox-dir ../../data/own_features_bbox/ --out-dir ../../data/own_features_FasterRCNN/ --resume
The directory ../../data/own_features_bbox/
will contain bounding boxes that are extracted by the FasterRCNN model, and this directory has to be created before running the job. The directory ../../data/own_features_FasterRCNN/
will contain the features extracted by the FasterRCNN model, and again has to be created before running the job.
Finally, after having extracted the features, you need to run the file convert_feature_export.py
in the data
folder of the meme challenge on the output directory:
python convert_feature_export.py --input_dir ../../data/own_features_FasterRCNN/ --output_dir ../../data/own_features/
This file converts the features extracted by the FasterRCNN model into the same format as the MMF features. Those final features should be used for training.
This repository contains a PyTorch reimplementation of the bottom-up-attention project based on Caffe.
We use Detectron2 as the backend to provide completed functions including training, testing and feature extraction. Furthermore, we migrate the pre-trained Caffe-based model from the original repository which can extract the same visual features as the original model (with deviation < 0.01).
Some example object and attribute predictions for salient image regions are illustrated below. The script to obtain the following visualizations can be found here
Note that most of the requirements above are needed for Detectron2.
-
Clone the project including the required version of Detectron2
# clone the repository inclduing Detectron2(@5e2a6f6) $ git clone --recursive https://github.com/MILVLG/bottom-up-attention.pytorch
-
Install Detectron2
$ cd detectron2 $ pip install -e .
Note that the latest version of Detectron2 is incompatible with our project and may result in a running error. Please use the recommended version of Detectron2 (@5e2a6f6) which is downloaded in step 1.
-
Compile the rest tools using the following script:
# install apex $ git clone https://github.com/NVIDIA/apex.git $ cd apex $ python setup.py install $ cd .. # install the rest modules $ python setup.py build develop
If you want to train or test the model, you need to download the images and annotation files of the Visual Genome (VG) dataset. If you only need to extract visual features using the pre-trained model, you can skip this part.
The original VG images (part1 and part2) are to be downloaded and unzipped to the datasets
folder.
The generated annotation files in the original repository are needed to be transformed to a COCO data format required by Detectron2. The preprocessed annotation files can be downloaded here and unzipped to the dataset
folder.
Finally, the datasets
folders will have the following structure:
|-- datasets
|-- vg
| |-- image
| | |-- VG_100K
| | | |-- 2.jpg
| | | |-- ...
| | |-- VG_100K_2
| | | |-- 1.jpg
| | | |-- ...
| |-- annotations
| | |-- train.json
| | |-- val.json
The following script will train a bottom-up-attention model on the train
split of VG. We are still working on this part to reproduce the same results as the Caffe version.
$ python3 train_net.py --mode detectron2 \
--config-file configs/bua-caffe/train-bua-caffe-r101.yaml \
--resume
-
mode = {'caffe', 'detectron2'}
refers to the used mode. We only support the mode with Detectron2, which refers todetectron2
mode, since we think it is unnecessary to train a new model using thecaffe
mode. -
config-file
refers to all the configurations of the model. -
resume
refers to a flag if you want to resume training from a specific checkpoint.
Given the trained model, the following script will test the performance on the val
split of VG:
$ python3 train_net.py --mode caffe \
--config-file configs/bua-caffe/test-bua-caffe-r101.yaml \
--eval-only --resume
-
mode = {'caffe', 'detectron2'}
refers to the used mode. For the converted model from Caffe, you need to use thecaffe
mode. For other models trained with Detectron2, you need to use thedetectron2
mode. -
config-file
refers to all the configurations of the model, which also include the path of the model weights. -
eval-only
refers to a flag to declare the testing phase. -
resume
refers to a flag to declare using the pre-trained model.
Similar with the testing stage, the following script will extract the bottom-up-attention visual features with provided hyper-parameters:
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101.yaml \
--image-dir <image_dir> --gt-bbox-dir <out_dir> --out-dir <out_dir> --resume
-
mode = {'caffe', 'detectron2'}
refers to the used mode. For the converted model from Caffe, you need to use thecaffe
mode. For other models trained with Detectron2, you need to use thedetectron2
mode. -
config-file
refers to all the configurations of the model, which also include the path of the model weights. -
image-dir
refers to the input image directory. -
gt-bbox-dir
refers to the ground truth bbox directory. -
out-dir
refers to the output feature directory. -
resume
refers to a flag to declare using the pre-trained model.
Moreover, using the same pre-trained model, we provide a two-stage strategy for extracting visual features, which results in (slightly) more accurate visual features:
# extract bboxes only:
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-bbox-only.yaml \
--image-dir <image_dir> --out-dir <out_dir> --resume
# extract visual features with the pre-extracted bboxes:
$ python3 extract_features.py --mode caffe \
--config-file configs/bua-caffe/extract-bua-caffe-r101-gt-bbox.yaml \
--image-dir <image_dir> --gt-bbox-dir <bbox_dir> --out-dir <out_dir> --resume
We provided pre-trained models here. The evaluation metrics are exactly the same as those in the original Caffe project. More models will be continuously updated.
Model | Mode | Backbone | Objects [email protected] | Objects weighted [email protected] | Download |
---|---|---|---|---|---|
Faster R-CNN | Caffe, K=36 | ResNet-101 | 9.3% | 14.0% | model |
Faster R-CNN | Caffe, K=[10,100] | ResNet-101 | 10.2% | 15.1% | model |
Faster R-CNN | Caffe, K=100 | ResNet-152 | 11.1% | 15.7% | model |
This project is released under the Apache 2.0 license.
This repo is currently maintained by Jing Li (@J1mL3e_) and Zhou Yu (@yuzcccc).