This repository contains the data, code and instructions to reproduce the results of the paper "No Word Embedding Model Is Perfect: Evaluating the Representation Accuracy for Social Bias in the Media", as published at the EMNLP 2022 conference. Please find the full reference below:
@InProceedings{spliethoever:2022,
address = {Abu Dhabi, United Arab Emirates},
author = {Maximilian Splieth{\"o}ver and Maximilian Keiff and Henning Wachsmuth},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2022},
month = dec,
publisher = {Association for Computational Linguistics},
title = {{No Word Embedding Model Is Perfect: Evaluating the Representation Accuracy for Social Bias in the Media}},
url = {https://aclanthology.org/2022.findings-emnlp.152/},
year = 2022
}
The dataset files and trained models can be obtained through Zenodo.
The database file containing the news articles has two main tables, article_urls
and article_contents
. The tables have the following columns:
article_urls:
uuid
: An ID that uniquely identifies this URL entry. This column is used as primary key for the table.url
: The plain text URL for the news article, as found in CommonCrawl.outlet_name
: The name of the news outlet that published the article.
article_contents:
uuid
: An ID that uniquely identifies this content entry. This column is used as primary key for the table. The key is the same key used in the article_urls table to allow for cross-referencing.date
: The automatically extracted publishing date of the article. If it was not possible to automatically extract the date, this field remains empty.content
: The plain text content of the article automatically extracted from the crawled HTML document.content_preprocessed
: The articles content split by sentences.langauge
: The langauge of the article, as identified by the langdetect module (as ISO 639-1 code).
To load the database into Python as a pandas.DataFrame
, you can use the get_articles_as_df()
method from src/embedding_bias/util.py
. An example usage can be found in any of the jupyter notebooks in the notebooks
directory.
In case you want to re-create the dataset or want to expand the crawling using our code, you can use the following scripts (in that order):
src/get_article_urls.py
: usesbin/gau
to retrieve a list of news article URLs from CommonCrawl for the outlets specified indata/raw/outlet-config.json
.src/get_article_contents.py
: retrieves that full article content for each URL retrieved in the first step.src/language_detection.py
: tries to automatically detect the langauge for each article (some outlets publish in multiple languages), making it easier to filter for a single language later on.src/sentencize_articles.py
: splits all articles into sentences to make it easier to work with later on.
After either generating the dataset yourself (see instructions above) or downloading the database from the link specified above, place the database file in the data/raw
directory. The name of the database is defined in the src/embedding_bias/config.py
file; all scripts will use this variable to find the database.
All necessary Python packages are described in the Pipfile
. If you prefer to not use pipenv, we also provide the standard requirements.txt
file (autogenerated from pipenv).
Follow the instructions below to train the respective models on the data.
word2vec models (Static embeddings): To train the static word2vec embeddings, use the code in the notebooks/embedding-generation.ipynb
notebook. Execute the two preparation cells at the top, as well as all cells under the "word2vec models (Static embeddings)" section. Since the tokenization code uses the huggingface tokenizer, this part requires a lot of memory (128-256 GB in our tests), but is very performant for this amount of text.
Frequency Agnostic models (FreqAgn embeddings): To train the frequency agnostic word embedding model, start with the code in the notebooks/embedding-generation.ipynb
notebook to prepare the data. The cells under the respective section heading read the article data from the database and convert it to a format that the original training scripts need. Make sure to use a version of tokenizers >= 0.12.x
; otherwise the tokenization won't work as expected. Then, to train the models, from within the src/Frequency-Agnostic
directory, run the frage-lstm-train.sh
script. After the training finished, additionally run the frage-lstm-pointer.sh
script. The script trains a model for only a single political orientation at the time. You can change this parameter by altering the ORIENTATION
parameter in the script. Possible values are "left", "center" and "right". Also, make sure to either delete or rename the dictionary_..
directory; it contains files specific to the dataset the model was trained on (i.e. the articles for each orientation) and you will get an error message during training otherwise. You can find our Weights & Biases training logs here.
Decontextualized models (Decontext embeddings): To generate the decontextualized word embeddings, use the code in the notebooks/embedding-generation.ipynb
notebook. The first cells will load and tokenize the data. We use the huggingface tokenizer here, which can require a lot of memory (for us, 128-256GB), but it is very fast. If that is not an option for you, consider replacing the tokenization code with any tokenizer of your choice. The rest of the code then searches for all sentences in the data that contain tokens of interest and generates contextualized embeddings for each. After doing that for all sentences of a token, the code generates a pooled embedding. Lastly, all pooled embeddings are saved to disk as a dictionary.
BERT fine-tuned: Since the BERT fine-tuning uses the same data format as the frequency-agnostic model training, either re-use the data you already generated for said model or refer to the notes above to do so. You can then use the finetune-mlm-bert.sh
script to fine-tune BERT. The script trains a model for only a single political orientation at the time. You can change this parameter by altering the ORIENTATION
parameter in the script. Possible values are "left", "center" and "right". You can find our Weights & Biases training logs here.
Temporal models: The temporal models use the same process as the decontextualized models; the only difference being the data they are trained on. You can simply follow the cells under the respective section heading to prepare the data and train the models.
All evaluation results can be generated using the code in the notebooks/embedding-evaluation.ipynb
notebook. After training all models (or downloading the pre-trained ones and adding them to the correct directory), the notebook evaluates them using different word embeddings similarity benchmarks and social bias measures.
For more details, please refer to the paper.
This project uses some code from other projects that were not written by the authors of this paper or only slightly modified. Please refer to the README
files of the specific sub-directories for more information.
Furthermore, the categorization of news outlets into political orientations was provided by allsides.com.