Skip to content

Latest commit

 

History

History
146 lines (117 loc) · 6.35 KB

README.md

File metadata and controls

146 lines (117 loc) · 6.35 KB

MAGPIE: Multi-Task Media-Bias Analysis of Generalization of Pre-Trained Identification of Expressions

This repository contains all resources from the paper "MAGPIE: Multi-Task Media-Bias Analysis of Generalization of Pre-Trained Identification of Expressions". MAGPIE is the first large-scale multi-task learning (MTL) approach for detecting media bias. To train MAGPIE the LBM (Large Bias Mixture), a comprehensive pre-training composition of 59 bias-related tasks encompassing linguistic bias, gender bias, group bias, and others is introduced.

1. Getting started

2. MTL Framework

3. Reproduce the results

4. Run your experiments

5. Citation

1. Getting started

Install python dependencies

In order to be able to use the framework or run an inference, please first install python dependencies via following:

pip install -r requirements.txt

wandb.ai API Access

Our training framework uses Weights&Biases to track all experiments. Please add your API KEY to local.env file. You can get your API KEY for free at wandb.ai.

2. MTL Framework

This repository contains code for training models in a multi-task learning fashion through the MTL framework. alt text

Datasets

We make our Large Bias Mixture (LBM) collection available in datasets directory. All datasets are in processed and cleaned state. Each datasets has acording preprocessor class under preprocessing directory and according script for preprocessing under scripts/preprocessors directory. However, preprocessing concerns the raw data, that can be found under our huggingface repository PLACEHOLDER FOR ANONYMOUS SUMBISSION.

Training

├─ training
     |
     ├─── data
     |
     ├─── model
     |    └─── optimization
     |    
     ├─── trainer
     |
     └─── tokenizer
          └─── mb-mtl-tokenizer

Training subdirectory consists of three main components:

  • data directory contains data structures
  • model directory contains definition of model architecture and classes for gradient manipulation.
  • trainer contains a main trainer.py class which orchestrates the whole multi-task training.

For further details please refer to training directory.

3. Reproduce the results

All experiments can be reproduced via running scripts in scripts directory. Each subdirectory in scripts/ has a run_experiment.py file defining the whole experiment.

  • scripts/ablation_study/ contains an evaluation of HSES and Resurrection optimization strategies
  • scripts/gradts_task_selection contains four-step pipeline for selecting the auxiliary tasks based on GradTS algorithm
  • scripts/hyperparameter_tuning contains hyperparameter search for option for robust selection of hyperparameters
  • scripts/lbm_taxonomy_analysis contains a script for co-training tasks based on task families
  • scripts/evaluation_robust contains final MAGPIE evaluation over 30 random seeds

4. Run your experiments

Running your own experiments can be done on multiple degrees of customization. You can customize the training based on the adjustments listed below

  1. Add your own datasets/tasks
  2. Define your own task-specific head in model heads class. Classification, Regression and Language Modelling tasks are implemented
  3. Choose encoder-only model of your choice and define it in enums/model_checkpoints
  4. Adjust the fixed training parameters (e.g., MAX_NUMBER_OF_STEPS, random seed, etc.) in config.py
  5. Write your own execution script choosing desired training parameters. An example:
     import wandb
     from config import head_specific_lr, head_specific_max_epoch, head_specific_patience
     from enums.aggregation_method import AggregationMethod
     from enums.model_checkpoints import ModelCheckpoint
     from enums.scaling import LossScaling
     from training.data import YOUR_TASK_A,YOUR_TASK_B,YOUR_TASK_C
     from training.model.helper_classes import EarlyStoppingMode, Logger
     from training.trainer.trainer import Trainer
     from utils import set_random_seed
    
     EXPERIMENT_NAME = "EXPERIMENT NAME"
    
    
     tasks = [YOUR_TASK_A,YOUR_TASK_B,YOUR_TASK_C]
    
     for t in tasks:
       for st in t.subtasks_list:
           st.process()
    
     config = {
       "sub_batch_size": 32,
       "eval_batch_size": 128,
       "initial_lr": 4e-5,
       "dropout_prob": 0.1,
       "hidden_dimension": 768,
       "input_dimension": 768,
       "aggregation_method": AggregationMethod.MEAN,
       "early_stopping_mode": EarlyStoppingMode.HEADS,
       "loss_scaling": LossScaling.STATIC,
       "num_warmup_steps": 10,
       "pretrained_path": None,
       "resurrection": True,
       "model_name": "YOUR_MODEL_NAME",
       "head_specific_lr_dict": head_specific_lr,
       "head_specific_patience_dict": head_specific_patience,
       "head_specific_max_epoch_dict": head_specific_max_epoch,
       "logger": Logger(EXPERIMENT_NAME),
     }
    
    
     set_random_seed() # default is 321
     wandb.init(project=EXPERIMENT_NAME,name="YOUR_MODEL_NAME")
     trainer = Trainer(task_list=tasks, LM=ModelCheckpoint.ROBERTA, **config)
     trainer.fit()
     trainer.eval(split=Split.TEST)
     trainer.save_model()
     wandb.finish()

3. Citation

Please cite us as:

@inproceedings{Horych2024a,
title = {MAGPIE: Multi-Task Analysis of Media-Bias Generalization with Pre-Trained Identification of Expressions},
author = {Tomas Horych and Martin Wessel and Jan Philip Wahle and Terry Ruas and Jerome Wassmuth and Andre Greiner-Petter and Akiko Aizawa and Bela Gipp and Timo Spinde},
url = {https://media-bias-research.org/wp-content/uploads/2024/04/Horych2024a.pdf},
year = {2024},
date = {2024-02-01},
urldate = {2024-02-01},
booktitle = {"Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation"},
keywords = {nlp,bias},
pubstate = {published},
tppubtype = {inproceedings}
}