The repository contains the data files and scripts corresponding to the paper "Neural Media Bias Detection Using Distant Supervision With BABE".
The models are uploaded on https://zenodo.org/record/5861846#.YeUoolkxkvE because of the GitHub file size restrictions.
- "raw_labels_MBIC.xlsx": individual annotator labels of MBIC's crowdsourcers.
- "raw_labels_SG1.xlsx": individual annotator labels of SG1 (8 expert annotators).
- "raw_labels_SG2.xlsx": individual annotator labels of SG2 (5 expert annotators).
- "final_labels_MBIC.xlsx": MBIC's aggregated labels over all annotators based on majority vote (1700 sentences).
- "final_labels_SG1.xlsx": SG1's aggregated labels over all annotators based on majority vote (same 1700 sentences as in MBIC).
- "final_labels_SG2.xlsx": SG2's aggregated labels over all annotators based on majority vote (3700 sentences).
- "silver-standart-dataset.xlsx": Silver standard dataset containing 1000 additional unlabeled sentences with potential biased text instances.
Columns:
- "text": sentences extracted from news articles and labeled in terms of bias and opinion.
- "news_link": url to the news article from which the sentence is extracted.
- "outlet": news platform publishing the news article.
- "topic": news topic.
- "type": political orientation of news platform according to mediacloud.org.
- "label_bias": bias label for the sentence ("Biased" or "Non-biased").
- "label_opinion": opinion label for the sentence ("Expresses writer's opinion" or "Somewhat factual but also opinionated" or "Entirely factual".
- "biased_words": words marked as biased by the annotators.
- "bias_word_lexicon.xlsx": dictionary of biased words used to craft features
- "dt_final_SG1.xlsx": final SG1 with engineered features
- "dt_final_SG2.xlsx": final SG2 with engineered features
- "news_headlines_usa_biased.csv": data set with distant labels of class biased
- "news_headlines_usa_neutral.csv": data set with distant labels of class neutral
- "data_set_evaluation.ipynb": script containing relevant code and results for the evaluation of the data sets (agreement calculations, label distribution...).
- "features_engineering.ipynb": engineering features for the baseline classifier
- "classification_baseline_model.ipynb": training and evaluation of the baseline classifier
- "classification.ipynb": training and evaluation of neural language models
- "distant_supervision.ipynb": pre-training on the data set with distant labels
- "topics_keywords_platforms.txt": a file containing all news topics, keywords to retrieve relevant news articles, and news platforms for the data set creation.
- "annotator_demographics.csv": a file containing demographic information about the annotators (The corresponding demographic questionnaire can be found under "demographic_questionnaire.pdf") .
@InProceedings{Spinde2021f,
title = "Neural Media Bias Detection Using Distant Supervision With {BABE} - Bias Annotations By Experts",
author = "Spinde, Timo and
Plank, Manuel and
Krieger, Jan-David and
Ruas, Terry and
Gipp, Bela and
Aizawa, Akiko",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.101",
doi = "10.18653/v1/2021.findings-emnlp.101",
pages = "1166--1177",
}
More about our work can be found here: https://media-bias-research.org/