This project contains instruction and code to reconstruct a dataset for development and evaluation of forensic tools for detecting machine generated text in social media.
- We are not releasing full twitter data to comply with Twitter terms of service.
- We are also not releasing generators and machine generated data for ethical reasons.
There are 3 main steps:
- Rehydrate source data from tweet ids
- Fine-tune natural language generation (NLG) models
- Generate machine generated text from NLG models and construct human, machine and mixed timelines of tweets
We source data from 3 english language Twitter datasets on 3 different topics:
- Vaccine: mostly anti-vax discussion around vaccines from https://github.com/gmuric/avax-tweets-dataset
- COVID: general covid dicussion from around the world from
datasets/covid_ids.csv
. This dataset was specifically constructed to have a significant number of users with more than 1 tweets. As per Twitter terms of service only tweet ids are provided. - Climate: general climate change discussion https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5QCCUU
Convert each source file of tweet ids to a csv with a single id
column:
id
1372199058945757184
1371797650098503680
1371465465692835840
1371013740573946112
1370348099491794944
1370130417471385600
To rehydrate tweet bodies from tweet_ids use rehydrate
function in utils.py
. Twitter. The output is saved as .jsonl
; each row is a tweet. You'll need valid Twitter credentials. We keep 100k samples from each dataset.
The next step is to finetune several NLG models.
We preprocess each dataset by:
- Cleaning up tweet formatting
- Droppping
RT
and@mentions
for retweets - Dropping duplicates
- Dropping boring tweets i.e. tweets with less than 5 non-trivial words (i.e. boring word begins with either
http
,#
or@
; borrowed from huggingtweets)
We split the clean data it into train and validation sets (95%/5%) to monitor model training for early stopping.
Refer to create_dataset
function in utils.py
We fine-tune 4 different pre-trained NLG models for each dataset: gpt2, gpt2-medium, gpt2-large,EleutherAI-gpt-neo-1.3B
. We use pre-trained models from huggingface transformers: https://huggingface.co/models
To fine-tune:
python train_deepspeed.py --help
usage: train_deepspeed.py [-h] [--lm_name LM_NAME] [--dataset DATASET] [--mode MODE] [--gpu GPU] [--batch_size BATCH_SIZE] [--model_batch_size MODEL_BATCH_SIZE]
[--num_samples NUM_SAMPLES] [--strategy STRATEGY] [--max_epochs MAX_EPOCHS]
optional arguments:
-h, --help show this help message and exit
--lm_name LM_NAME huggingface model name (default: EleutherAI/gpt-neo-2.7B)
--dataset DATASET dataset name (default: avax)
--mode MODE mode = train,generate (default: train)
--gpu GPU gpus to use (default: 0)
--batch_size BATCH_SIZE
desired total batch size (default: 32)
--model_batch_size MODEL_BATCH_SIZE
batch that fits on gpu (default: 2)
--num_samples NUM_SAMPLES
number of samples to generate (default: 1000)
--strategy STRATEGY model parallelization strategy, use deepspeed_2 or 3 for large models to shard (default: None)
--max_epochs MAX_EPOCHS
max epochs (default: 5)
We concatenate all tweet text with EOS_TOKEN
and split it into chunks of max_lenght of 72 with overlap of 4 tokens. We use effective size of 32 and train for 5 epochs with learning rate of 5e-5 optimzing the causal language model objective i.e. next word prediciton. We keep the model with best loss on the validation set. For large models, you may need a large GPU or use model parallelism strategies like deepspeed_2
or 3
We construct two types of data. Pure timelines and mixed timelines. A timeline is a sequence of tweets.
In a pure timeline, all tweets are either machine generated or human generated. To generate a timeline of N tweets:
- Machine generated: generate N tweets using an NLG
NLG_NAME
trained onDATASET
:
python train_deepspeed.py --mode generate --num_samples N --lm_name NLG_NAME --dataset DATASET
- Human generated: sample N tweets from dataset. For COVID dataset, we have enough tweets per user, so we first sample user
USER
and then sampleN
tweets fromUSER
's tweets.
Mixed timelines contain a mix of human and machine generated tweets. To generate a timeline of lenght N
with K
machine generated tweets.
- We first sample
N-K
human generated tweets fromDATASET
(conditioned on same user if that info is available) - Then concetanate them with a sample of
K
machine generated tweets from an NLGNLG_NAME
trained onDATASET
. Note human generated tweets are alwways before machine generated tweets.
For full dataset details see: dataset.md