Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Dyah Adila , Shuai Zhang, Boran Han, Bernie Wang
- We propose SteerFair, an unsupervised inference-time activation steering algorithm to mitigate foundation model bias.
- We demonstrate that SteerFair can effectively address the instability concerning option ordering in question-answering tasks. Furthermore, our findings demonstrate that the bias direction pinpointed by SteerFair is generalizable across datasets with the same task.
- Extensive experimental evidence shows improvement on three instruction-tuned models, with reduced performance variability by 10.86% accuracy points across three datasets.
Required packages:
- baukit
- transformers, PyTorch, numpy, tqdm, sklearn, PIL
- einops
- For LLaVA model use, follow the installation requirement in the original repository: LLaVA repo
Generate cylic permutations version of the options. For example in n_option = 3
:
Original options:
(A) Apple (B) Banana (C) Cherry
Cyclic permutations:
(A) Cherry (B) Apple (C) Banana
(A) Banana (B) Cherry (C) Apple
Run the following to generate prompt files with permuted answers:
mkdir ~/ScienceQA/data/scienceqa/debias_baseline
cd llava/scripts
python debias_mcq_baseline.py baseline_attack --base-dir [YOUR_DATASET_PATH] --split train --n-option {n_option}
cd llava/vg_relation
python convert_vgr_to_yesno.py "--file-dir [YOUR_DEMONSTRATION_SET_PATH] --split train
Saving attention head values
cd llava/pca_editing/vgr
python get_head_values.py --base-dir [YOUR_DATASET_PATH] --split train
Identify bias directions from bias demonstrations
cd llava/pca_editing
python get_pca_direction.py --head-values-dir [YOUR SAVED HEAD VALUES DIR] --save-dir [YOUR DESTINATION SAVE DIRECTORY]
cd llava/pca_editing/vgr
bash run_steering.sh # for no tuning version
bash run_steering_selected_heads.sh # for tuning version
To run the algorithm with IDEFICS or InstructBLIP model, simply change the llava
path to idefics
and instructBLIP
directory.
A refactor of the code to make everything in a single script is coming soon.
If you have any questions, please feel free to create an issue on this repository.
If you find this repo useful, please star (★) this repository or cite the following bibtex entry:
@article{adila2024discovering,
title={Discovering Bias in Latent Space: An Unsupervised Debiasing Approach},
author={Adila, Dyah and Zhang, Shuai and Han, Boran and Wang, Yuyang},
journal={ICML},
year={2024}
}
Our code is based on LLaVA repositories. We thank the authors for releasing their code.