This repository will hold the official code of SelTDA, the self-training framework introduced in our CVPR 2023 paper "Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!".
conda env create -f environment.yaml
- PathVQA
- then use
convert_pathvqa.py
- then use
- RSVQA
- then use
convert_rsvqa.py
- then use
- OK-VQA and A-OKVQA (use LAVIS)
- LAVIS should automatically put them in the correct format, but if not, you can use
convert_okvqa.py
- LAVIS should automatically put them in the correct format, but if not, you can use
- VQA Counterexamples
- then use
convert_vqa_ce.py
- then use
- AdVQA
- then use
convert_advqa.py
- then use
- VQA Rephrasings
- then use
convert_vqa_rephrasings.py
- then use
In general, the code expects that each VQA dataset is represented by a single JSON object that is a list of dictionaries. In schemas.py
, we provide Pydantic models which you can use to define your own datasets or verify that the data is in the correct format.
See the examples/
directory to see examples of:
- training the teacher
examples/train_teacher.sh
- generating synthetic data with the teacher
examples/generate_synthetic_data.sh
- self-training with the synthetic data
examples/self_train_synthetic.sh
- evaluations
examples/evaluate.sh
@InProceedings{Khan_2023_CVPR,
author = {Khan, Zaid and BG, Vijay Kumar and Schulter, Samuel and Yu, Xiang and Fu, Yun and Chandraker, Manmohan},
title = {Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {15005-15015}
}
This code is heavily based on salesforce/BLIP.