🛋️ Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
This repository provides the code and instructions for using the evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs, COnsistent Multilingual Frame Of Reference Test (COMFORT). Follow the steps below to set up the environment, generate data (optional), and run experiments. Feel free to create an issue if you encounter any problems. We also welcome pull requests.
- Setup Environment
- Prepare Data
- Add API Credentials
- Run Experiments
- Run Evaluations
- Evaluate More Models
- Common Problems and Solutions
Clone the repository and create a conda environment using the provided environment.yml
file:
git clone https://github.com/sled-group/COMFORT.git
cd comfort_utils
conda env create -f environment.yml
After creating the environment:
conda activate comfort
Then, install editable packages:
cd models/GLAMM
pip install -e .
cd models/llava
pip install -e .
cd models/InternVL/internvl_chat
pip install -e .
You can also use Poetry to setup the environment.
Firstly, make a data directory:
mkdir data
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_ball.zip?download=true -O data/comfort_ball.zip
unzip data/comfort_ball.zip -d data/
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_car_ref_facing_left.zip?download=true -O data/comfort_car_ref_facing_left.zip
unzip data/comfort_car_ref_facing_left.zip -d data/
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_car_ref_facing_right.zip?download=true -O data/comfort_car_ref_facing_right.zip
unzip data/comfort_car_ref_facing_right.zip -d data/
pip install gdown
python download_assets.py
chmod +x generate_dataset.sh
./generate_dataset.sh
touch comfort_utils/model_utils/api_keys.py
-
Prepare OpenAI and DeepL API keys and add below to api_keys.py
APIKEY_OPENAI = <YOUR_API_KEY> APIKEY_DEEPL = <YOUR_API_KEY>
-
Prepare Google Cloud Translate API credentials (.json)
./run_english_ball_experiments.sh
./run_english_car_left_experiments.sh
./run_english_car_right_experiments.sh
export GOOGLE_APPLICATION_CREDENTIALS="your_google_application_credentials_path.json"
./run_multilingual_ball_experiments.sh
./run_multilingual_car_left_experiments.sh
./run_multilingual_car_right_experiments.sh
- Preferred Coordinate Transformation (Table 2 & Table 7):
python gather_results.py --mode cpp --cpp convention
- Preferred Frame of Reference (Table 3 & Table 8):
python gather_results.py --mode cpp --cpp preferredfor
- Perspective Taking (Table 4 & Table 9):
python gather_results.py --mode cpp --cpp perspective
- Comprehensive Evaluation (Table 5):
python gather_results.py --mode comprehensive
python gather_results_multilingual.py
After evaluation completes:
cd results/eval
python eval_multilingual_preferredfor_raw.py
We refer to Model Wrapper.
- ImportError: libcupti.so.11.7: cannot open shared object file: No such file or directory
pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118
@misc{zhang2024visionlanguagemodelsrepresentspace,
title={Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities},
author={Zheyuan Zhang and Fengyuan Hu and Jayjun Lee and Freda Shi and Parisa Kordjamshidi and Joyce Chai and Ziqiao Ma},
year={2024},
eprint={2410.17385},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.17385},
}