Skip to content

Repo for the paper "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities"

Notifications You must be signed in to change notification settings

sled-group/COMFORT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛋️ Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities

Paper Project Page Hugging Dataset

COMFORT

This repository provides the code and instructions for using the evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs, COnsistent Multilingual Frame Of Reference Test (COMFORT). Follow the steps below to set up the environment, generate data (optional), and run experiments. Feel free to create an issue if you encounter any problems. We also welcome pull requests.

Table of Contents

  1. Setup Environment
  2. Prepare Data
  3. Add API Credentials
  4. Run Experiments
  5. Run Evaluations
  6. Evaluate More Models
  7. Common Problems and Solutions

Setup Environment

Clone the repository and create a conda environment using the provided environment.yml file:

git clone https://github.com/sled-group/COMFORT.git
cd comfort_utils
conda env create -f environment.yml

After creating the environment:

conda activate comfort

Then, install editable packages:

cd models/GLAMM
pip install -e .
cd models/llava
pip install -e .
cd models/InternVL/internvl_chat
pip install -e .

You can also use Poetry to setup the environment.

Prepare data

Firstly, make a data directory:

mkdir data

(Option 1.) Download data from Huggingface

wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_ball.zip?download=true -O data/comfort_ball.zip
unzip data/comfort_ball.zip -d data/
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_car_ref_facing_left.zip?download=true -O data/comfort_car_ref_facing_left.zip
unzip data/comfort_car_ref_facing_left.zip -d data/
wget https://huggingface.co/datasets/sled-umich/COMFORT/resolve/main/comfort_car_ref_facing_right.zip?download=true -O data/comfort_car_ref_facing_right.zip
unzip data/comfort_car_ref_facing_right.zip -d data/

(Option 2.) Data generation

pip install gdown
python download_assets.py
chmod +x generate_dataset.sh
./generate_dataset.sh

Add API Credentials

touch comfort_utils/model_utils/api_keys.py
  1. Prepare OpenAI and DeepL API keys and add below to api_keys.py

    APIKEY_OPENAI = <YOUR_API_KEY>
    APIKEY_DEEPL = <YOUR_API_KEY>
    
  2. Prepare Google Cloud Translate API credentials (.json)

Run Experiments

./run_english_ball_experiments.sh
./run_english_car_left_experiments.sh
./run_english_car_right_experiments.sh

export GOOGLE_APPLICATION_CREDENTIALS="your_google_application_credentials_path.json"
./run_multilingual_ball_experiments.sh
./run_multilingual_car_left_experiments.sh
./run_multilingual_car_right_experiments.sh

Run Evaluations

English

  1. Preferred Coordinate Transformation (Table 2 & Table 7):
    python gather_results.py --mode cpp --cpp convention
  2. Preferred Frame of Reference (Table 3 & Table 8):
    python gather_results.py --mode cpp --cpp preferredfor
  3. Perspective Taking (Table 4 & Table 9):
    python gather_results.py --mode cpp --cpp perspective
  4. Comprehensive Evaluation (Table 5):
    python gather_results.py --mode comprehensive

Multilingual (Figure 8 & Table 10)

python gather_results_multilingual.py

After evaluation completes:

cd results/eval
python eval_multilingual_preferredfor_raw.py

Evaluate More Models

We refer to Model Wrapper.

Common Problems and Solutions

  1. ImportError: libcupti.so.11.7: cannot open shared object file: No such file or directory
    pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118

Citation 🖋️

@misc{zhang2024visionlanguagemodelsrepresentspace,
       title={Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities}, 
       author={Zheyuan Zhang and Fengyuan Hu and Jayjun Lee and Freda Shi and Parisa Kordjamshidi and Joyce Chai and Ziqiao Ma},
       year={2024},
       eprint={2410.17385},
       archivePrefix={arXiv},
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2410.17385},
     }

About

Repo for the paper "Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published