Dialect Aware Social Bias Identification

Code for the paper Disentangling Dialect from Social Bias via Multitask Learning to Improve Fairness.

For details on the approach, architecture and idea, please see the published paper.

@inproceedings{spliethover-etal-2024-disentangling,
    title =      "Disentangling Dialect from Social Bias via Multitask Learning to Improve Fairness",
    author =     Splieth{\"o}ver, Maximilian and Menon, Sai Nikhil and Wachsmuth, Henning,
    booktitle =  "Findings of the Association for Computational Linguistics ACL 2024",
    month =      aug,
    year =       "2024",
    address =    "Bangkok, Thailand and virtual meeting",
    publisher =  "Association for Computational Linguistics",
    url =        "https://aclanthology.org/2024.findings-acl.553",
    pages =      "9294--9313",
}

General

Each directory-related approach contains a data-preparation.py script. This should be run before any training scripts.

Data preparation

One general and several approach-specific data preparation scripts exist. Run the general preparation script first, then the approach specific scripts.
- Use the sbic-data-preparation.ipynb notebook to prepare TwitterAAE and SBIC corpora.
The AAE dialect identification is (the last) part of the preprocessing, as later approaches use the annoations of this approach.

AAE dialect identification

Data

The dataset published with the paper "Investigating African-American Vernacular English in Transformer-Based Text Generation" is used to train the dialect classifier. The directory sae-aave-pairs/ from the dataset is expected to be present in aae-classification/data/.

Base models

The basemodel (DeBERTa-v3-large) is expected in ./aae-classification/model.

TwitterAAE baseline

The code in the twitteraae directory was originally published with Demographic Dialectal Variation in Social Media: A Case Study of African-American English" by Su Lin Blodgett, Lisa Green, and Brendan O'Connor, EMNLP 2016. We use the code and approach as baseline.

Script execution order

./aae-classification/data-preparation.py
./aae-classification/data-splits.py
./aae-classification/train-weights.sh
./aae-classification/train-interleaving.sh

Finetuning approach

Data

The approach expects the pre-processed SBIC corpus (see "Social Bias Frames: Reasoning about Social and Power Implications of Language" published by Sap et al. 2022) to present in ./data/sbic-data/.

Script execution order

./finetuning/train.sh
./finetuning/inference-seeded.sh

Multitask learning approach

Data

The approach expects the pre-processed (and with AAE dialect annotated) SBIC corpus (see "Social Bias Frames: Reasoning about Social and Power Implications of Language" published by Sap et al. 2022) to present in ./aae-classification/output/sbic-test_aae-annotated-deberta-v3-base-aee-classifier.

Script execution order

./joint-multitask-learning/data-preparation.py
./joint-multitask-learning/train*.sh (depending on the approach you want to train)
./joint-multitask-learning/inference*.sh (depending on the approach trained before)

Trained models

The trained models evaluated in the paper can be found on huggingface.co :

Classification results

The result files can be found in the model repositories specified above. Specifically:

The TwitterAAE classification results can be found here. The results/twitteraae-dialect-classification directory contains the classification results of the baseline, the weighted loss model, and the data subsampling model on the TwitterAAE dataset.
The SBIC data with AAE dialect annotations, based on our classifier, can be found here, in the results/sbic-dialect-classification/ directory.
The bias classification results from all models shown and evaluated in the paper can be found here, in the results/sbic-bias-classificaiton/ directory. Each model was run with five different random seeds, as indicated by the -seedX postfix of each result file.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
aae-classification		aae-classification
common		common
finetuning		finetuning
intermediate		intermediate
joint-multitask-learning		joint-multitask-learning
.gitattributes		.gitattributes
README.md		README.md
conda-environment.yml		conda-environment.yml
evaluation.ipynb		evaluation.ipynb
requirements.txt		requirements.txt
sbic-data-preparation.ipynb		sbic-data-preparation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dialect Aware Social Bias Identification

General

Data preparation

AAE dialect identification

Data

Base models

TwitterAAE baseline

Script execution order

Finetuning approach

Data

Script execution order

Multitask learning approach

Data

Script execution order

Trained models

Classification results

About

Releases

Packages

Languages

webis-de/acl24-dialect-aware-bias-detection

Folders and files

Latest commit

History

Repository files navigation

Dialect Aware Social Bias Identification

General

Data preparation

AAE dialect identification

Data

Base models

TwitterAAE baseline

Script execution order

Finetuning approach

Data

Script execution order

Multitask learning approach

Data

Script execution order

Trained models

Classification results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages