Fairness in Large Language Models

This ongoing project aims to consolidate interesting efforts in the field of fairness in Large Language Models (LLMs), drawing on the proposed taxonomy and surveys dedicated to various aspects of fairness in LLMs.

Disclaimer: We may have missed some relevant papers in the list. If you have suggestions or want to add papers, please submit a pull request or email us—your contributions are greatly appreciated!

Tutorial: Fairness in Large Language Models in Three Hours
Thang Viet Doan, Zichong Wang, Nhat Hoang and Wenbin Zhang
Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), Boise, USA, 2024

Fairness in LLMs: Fairness in Large Language Models: A Taxonomic Survey
Zhibo Chu, Zichong Wang and Wenbin Zhang
ACM SIGKDD Explorations Newsletter, 2024

Introduction to LLMs: History, Development, and Principles of Large Language Models-An Introductory Survey
Zichong Wang, Zhibo Chu, Thang Viet Doan, Shiwen Ni, Min Yang and Wenbin Zhang
AI and Ethics, 2024

Fairness Definitions in LLMs: Fairness Definitions in Language Models Explained
Thang Viet Doan, Zhibo Chu, Zichong Wang and Wenbin Zhang

Email: ziwang@fiu.edu - Zichong Wang
wenbinzhang2008@gmail.com - Wenbin Zhang

📚 Contents

Mitigating Bias in LMs
- By Year
- By Category
Quantifying Bias in LMs
Datasets
Citation

Mitigating Bias in LMs (Link to the paper)

By Year

2024

Social Bias Probing: Fairness Benchmarking for Language Models [EMNLP]
Strategic Demonstration Selection for Improved Fairness in LLM In-Context Learning [EMNLP]
Systematic Biases in LLM Simulations of Debates [EMNLP]
Mitigating Language Bias of LMMs in Social Intelligence Understanding with Virtual Counterfactual Calibration [EMNLP]
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners [EMNLP]
Mitigating Frequency Bias and Anisotropy in Language Model Pre-Training with Syntactic Smoothing [EMNLP]
“You Gotta be a Doctor, Lin”: An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations [EMNLP]
Humans or LLMs as the Judge? A Study on Judgement Bias [EMNLP]
Walking in Others’ Shoes: How Perspective-Taking Guides Large Language Models in Reducing Toxicity and Bias [EMNLP]
Decoding Matters: Addressing Amplification Bias and Homogeneity Issue in Recommendations for Large Language Models [EMNLP]
Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment [EMNLP]
Split and Merge: Aligning Position Biases in LLM-based Evaluators [EMNLP]
Mitigating Language Bias of LMMs in Social Intelligence Understanding with Virtual Counterfactual Calibration [EMNLP]
Hidden Persuaders: How LLM Political Bias Could Sway Our Elections [EMNLP]
Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble [EMNLP]
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [Neurips]
Bias Amplification in Language Model Evolution: An Iterated Learning Perspective [Neurips]
Probing Social Bias in Labor Market Text Generation by ChatGPT: A Masked Language Model Approach [Neurips]
UniBias: Unveiling and Mitigating LLM Bias through Internal Attention and FFN Manipulation [Neurips]
Allegator: Alleviating Attention Bias for Visual-Informed Text Generation [Neurips]
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks [Neurips]
Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias [Neurips]
Unveiling the Bias Impact on Symmetric Moral Consistency of Large Language Models [Neurips]

By Category

Pre-processing

Measuring and reducing gendered correlations in pre-trained models [arXiv]
Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology [ACL]
Deep Learning on a Healthy Data Diet: Finding Important Examples for Fairness [AAAI]
Improving Gender Fairness of Pre-Trained Language Models without Catastrophic Forgetting.[ACL]
Data-Centric Explainable Debiasing for Improving Fairness in Pre-trained Language Models. [ACL]

In-training

Enhancing Model Robustness and Fairness with Causality: A Regularization Approach [ACL]
Never Too Late to Learn: Regularizing Gender Bias in Coreference Resolution [WSDM]
Sustainable Modular Debiasing of Language Models [ACL]
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection [ACL]
Does Gender Matter? Towards Fairness in Dialogue Systems [COLING]
Debiasing pretrained text encoders by paying attention to paying attention [EMNLP]
Reducing Gender Bias in Word-Level Language Models with a Gender-Equalizing Loss Function [ACL]
FineDeb: A Debiasing Framework for Language Models [arXiv]

Intra-processing

Debiasing algorithm through model adaptation [ICLR]
DUnE: Dataset for Unified Editing [EMNLP]
Reducing Sentiment Bias in Language Models via Counterfactual Evaluation [EMNLP]
Using In-Context Learning to Improve Dialogue Safety [EMNLP]
DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts [ACL]

Post-processing

Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting [arXiv]
Queer People are People First: Deconstructing Sexual Identity Stereotypes in Large Language Models [ACL]
Text Style Transfer for Bias Mitigation using Masked Language Modeling [NAACL]
They, them, theirs: Rewriting with gender-neutral english [arXiv]

Quantifying Bias in LMs (Link to the paper)

Fairness definitions for medium-sized LLMs

Intrinsic bias: also known as upstream bias or representational bias, refers to the inherent biases present in the output representation generated by a medium-sized LM

Similarity-based bias
- Semantics derived automatically from language corpora contain human-like biases [arXiv]
- On measuring social biases in sentence encoders [NAACL]
- Detecting emergent intersectional biases: Contextualized word embeddings contain a distribution of human-like biases [AAAI]
Probability-based bias
- Measuring and reducing gendered correlations in pre-trained models [arXiv]
- Measuring bias in contextualized word representations [arXiv]
- Mitigating language-dependent ethnic bias in BERT [arXiv]
- Masked language model scoring [arXiv]
- StereoSet: Measuring stereotypical bias in pretrained language models [arXiv]
- CrowS-pairs: A challenge dataset for measuring social biases in masked language models [arXiv]
- Unmasking the mask–evaluating social biases in masked language models [AAAI]
- Pro-Woman, Anti-Man? Identifying Gender Bias in Stance Detection [ACL]

Extrinsic bias: refers to the disparity in a LM’s performance across different downstream tasks, also known as downstream bias or prediction bias

Classification
- Bias in bios: A case study of semantic representation bias in a high-stakes setting [arXiv]
Natural Language Inference
- On measuring and mitigating biased inferences of word embeddings [AAAI]
- On measuring social biases in prompt-based multi-task learning [arXiv]
Question answering
- BBQ: A hand-built bias benchmark for question answering [arXiv]
Recommender Systems
- Up5: Unbiased foundation model for fairness-aware recommendation [arXiv]

Fairness definitions for large-sized LLMs

Fairness in these models is evaluated using specific strategies designed to quantify it.

Demographic Representation: The systematic discrepancy in the frequency of mentions of different demographic groups within the generated text.
- Language models are few-shot learners [NeurIPS]
- Understanding stereotypes in language models: Towards robust measurement and zero-shot debiasing [arXiv]
- Holistic evaluation of language models [arXiv]
Stereotypical Association: The systematic discrepancy in the model’s associations between demographic groups and specific stereotypes, which reflects societal prejudice
- Language models are few-shot learners [NeurIPS]
- Holistic evaluation of language models [arXiv]
- Persistent anti-muslim bias in large language models [AAAI]
Counterfactual Fairnes: The model’s sensitivity to demographic-specific terms, measuring how changes to these terms affect its output
- Holistic evaluation of language models [arXiv]
- Fairness of chatgpt [arXiv]
Performance Disparities: The systematic variation in accuracy or other performance metrics when the model is applied to tasks involving different demographic groups
- Holistic evaluation of language models [arXiv]
- Biasasker: Measuring the bias in conversational ai system [ACM]
- Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation [ACM]

Datasets

WinoBias
WinoBias+
WinoGender
WinoQueer
BEC-Pro
BUG
GAP
StereoSet
HONEST
Bias-NLI
CrowS-Pairs
EEC
PANDA
RedditBias
TrustGPT
FairPrism
BOLD
RealToxicityPrompts
HolisticBias
BBQ
UnQover
CEB

Citation