DialogRPT predicts human feedback (upvotes👍 or replies💬) of dialogue responses.
It is a set of dialog response ranking models proposed by Microsoft Research NLP Group trained on 100 + millions of human feedback data, accepted to appear at EMNLP'20. It can be used to improve existing dialog generation model (e.g., DialoGPT) by re-ranking the generated response candidates. This repo provides a PyTorch implementation and pretrained models.
Quick links:
- Paper
- Intro Talk and Slides
- Demo: original or HuggingFace
- Dataset
We considered the following tasks and provided corresponding pretrained models. (Click 💾 to download original pytorch checkpoint for this repo, or click 🤗 to use HuggingFace model card)
Task | Description | Pretrained model |
---|---|---|
Human feedback | ||
updown |
How likely the response gets the most upvotes? | 💾 / 🤗 |
width |
How likely the response gets the most direct replies? | 💾 / 🤗 |
depth |
How likely the response gets the longest follow-up thread? | 💾 / 🤗 |
Human-like (human vs fake) | ||
human_vs_rand |
How relevant the response is for the given context? | 💾 / 🤗 |
human_vs_machine |
How likely the response is human-written rather than machine-generated? | 💾 / 🤗 |
- Quick Start
- Install, try this demo, or use Hugging Face model card with this demo
- Use rankers only: use DialogRPT as evalution metric
- Use generator + ranker: improve generator by reranking hypotheses with DialogRPT
- Data
- Training
- Evaluation
- Citation
Option 1: run locally
git clone https://github.com/golsun/DialogRPT
cd DialogRPT
conda create -n dialogrpt python=3.6
conda activate dialogrpt
pip install -r requirements.txt
Option 2: run on Colab Notebook. You can either use Demo (original) or Demo (HuggingFace)
In the following example, the model predicts that, given the same context "I love NLP!", response "Here’s a free textbook (URL) in case anyone needs it." is gets more upvotes than response "Me too!".
python src/score.py play -p=restore/updown.pth
#
# Context: I love NLP!
# Response: Here’s a free textbook (URL) in case anyone needs it.
# score = 0.613
# Context: I love NLP!
# Response: Me too!
# score = 0.111
You can also play the ensemble model, which involves multiple models defined in its config file (see this file for details).
python src/main.py play -p=restore/ensemble.yml
To score a list of (context, response) pairs, please provide a input file (--data
), which is tab-separated in format context \t response0 \t response1 ...
. See example input file
- Using a single ranker (see expected output)
python src/score.py test --data=doc/toy.tsv -p=restore/updown.pth
# downloading pretrained model to restore/updown.pth
# 100% [....................] 1520029114 / 1520029114
# loading from restore/updown.pth
# ranking doc/toy.tsv
# totally processed 2 line, avg_hyp_score 0.264, top_hyp_score 0.409
# results saved to doc/toy.tsv.ranked.jsonl
- Using an ensemble model (see expected output)
python src/score.py test --data=doc/toy.tsv -p=restore/ensemble.yml
Statistics of the scoring results can be shown with the following command, e.g. for doc/toy.tsv.ensemble.jsonl
python src/score.py stats --data=doc/toy.tsv.ensemble.jsonl
# |best |avg
# ----------------------------------------
# _score |0.339 |0.206
# human_vs_rand |0.928 |0.861
# human_vs_machine |0.575 |0.525
# updown |0.409 |0.264
# depth |0.304 |0.153
# width |0.225 |0.114
# final |0.339 |0.206
# ----------------------------------------
# n_cxt: 2
# avg n_hyp per cxt: 2.50
Dialog generation models can be improved by integrating with the response ranking models.
For example, given the context "Can we restart 2020?", DialoGPT may return the following responses by sampling decoding (or you can try beam search without --sampling
). Some of them, e.g., "Yes, we can." has a high generation probability (gen 0.496
), but less interesting (ranker 0.302
). So the rankers will put in position lower than ones more likely to be upvoted, e.g. "I think we should go back to the beginning, and start from the beginning." which is relatively less likely to be generated (gen 0.383
) but seems more interesting (ranker 0.431
)
python src/generation.py play -pg=restore/medium_ft.pkl -pr=restore/updown.pth --sampling
#
# Context: Can we restart 2020?
# 0.431 gen 0.383 ranker 0.431 I think we should go back to the beginning, and start from the beginning.
# 0.429 gen 0.227 ranker 0.429 I think I'll just sit here and wait for 2020
# 0.377 gen 0.249 ranker 0.377 Yeah, let's just start from the beginning
# 0.323 gen 0.195 ranker 0.323 I think we should just give up and let the year just pass.
# 0.304 gen 0.395 ranker 0.304 Yes. We can.
# 0.302 gen 0.496 ranker 0.302 Yes, we can.
# 0.283 gen 0.351 ranker 0.283 It's been a while since we've seen a good reboot.
# 0.174 gen 0.306 ranker 0.174 I'm up for it
# 0.168 gen 0.463 ranker 0.168 I'm down
# 0.153 gen 0.328 ranker 0.153 I think so, yes.
# ...
Similarly, you can use the ensemble model.
python src/generation.py -pg=restore/medium_ft.pkl -pr=restore/ensemble.yml
To generate from a list of contexts stored in a line-separated file, provide it with --path_test
and use the command below:
python src/generation.py test --path_test=path/to/list/of/contexts -pg=restore/medium_ft.pkl -pr=restore/ensemble.yml
As the Pushshift Reddit dataset was deleted from this server, the data extraction pipeline of this release no longer works. As an alternative, you may want to use the Pushshift API.
We use DialoGPT to initialize the model. Please download with
wget https://convaisharables.blob.core.windows.net/lsp/multiref/medium_ft.pkl -P restore
For the human feedback prediction tasks, we specify min_score_gap
and min_rank_gap
to only validate on less-noisy samples (not applied to training).
python src/main.py train --data=data/out/updown -p=restore/medium_ft.pkl --min_score_gap=20 --min_rank_gap=0.5
python src/main.py train --data=data/out/depth -p=restore/medium_ft.pkl --min_score_gap=4 --min_rank_gap=0.5
python src/main.py train --data=data/out/width -p=restore/medium_ft.pkl --min_score_gap=4 --min_rank_gap=0.5
For human_vs_rand
task, use the --mismatch
flag to feed rand human response as negative examples. We can reuse previous dataset (e.g. data/out/updown
).
python src/main.py train --data=data/out/updown -p=restore/medium_ft.pkl --mismatch
For human_vs_machine
task, we build dataset by pair human response with a response generated by DialoGPT with topk decoding
python src/main.py train --data=data/out/human_vs_machine -p=restore/medium_ft.pkl
We trained all models on a Nvidia V100 4-core GPU (each core with 32G memory) with the following hyperparameters. Checkpoint with the best validation accuracy is used as final model.
Argument | Value | Description |
---|---|---|
batch |
256 | total batch size for all GPUs. |
vali_size |
1024 | number of samples used for validation (i.e. dev set size). |
lr |
3e-05 | learning rate |
max_seq_len |
50 | max allowed sequence length. if longer, leading tokens will be truncated |
max_hr_gap |
1 | max allowed hour difference between positive and negative samples. If longer, this pair will be discarded for train/vali |
The performance on updown
, depth
, and width
can be measured with the following commands, respectively.
The --min_score_gap
and --min_rank_gap
arguments are consistent with the values used to measure validation loss during training.
python src/score.py eval_human_feedback -p=restore/updown.pth --data=test/human_feedback/updown.tsv --min_score_gap=20 --min_rank_gap=0.5
python src/score.py eval_human_feedback -p=restore/depth.pth --data=test/human_feedback/depth.tsv --min_score_gap=4 --min_rank_gap=0.5
python src/score.py eval_human_feedback -p=restore/width.pth --data=test/human_feedback/width.tsv --min_score_gap=4 --min_rank_gap=0.5
The expected pairwise accuracy on 5000 test samples is listed in the table below (from Table 5 of the paper). Note even by random guess one can get accuracy of 0.500.
human feedback | updown |
depth |
width |
---|---|---|---|
Dialog ppl. | 0.488 | 0.508 | 0.513 |
Reverse dialog ppl. | 0.560 | 0.557 | 0.571 |
DialogRPT (ours) | 0.683 | 0.695 | 0.752 |
human_vs_rand
task: Although the model is trained onreddit
corpus only, we measured its zero-shot performance on several unseen corpora (twitter
,dailydialog
andpersonachat
)
python src/score.py eval_human_vs_rand -p=restore/human_vs_rand.pth --data=test/human_vs_fake/reddit
python src/score.py eval_human_vs_rand -p=restore/human_vs_rand.pth --data=test/human_vs_fake/dailydialog
python src/score.py eval_human_vs_rand -p=restore/human_vs_rand.pth --data=test/human_vs_fake/twitter
python src/score.py eval_human_vs_rand -p=restore/human_vs_rand.pth --data=test/human_vs_fake/personachat
The expected hits@k
metric on 5000 test samples is listed in the table below (from Table 7 of the paper).
hits@k
measures, for the same context, given k
positive responses and n
negative responses, how many positive responses are in top-k
of the ranked responses.
human_vs_rand |
reddit |
dailydialog |
twitter |
personachat |
---|---|---|---|---|
BM25 | 0.309 | 0.182 | 0.178 | 0.117 |
Dialog ppl. | 0.560 | 0.176 | 0.107 | 0.108 |
Reverse dialog ppl. | 0.775 | 0.457 | 0.440 | 0.449 |
ConveRT | 0.760 | 0.380 | 0.439 | 0.197 |
DialogRPT (ours) | 0.886 | 0.621 | 0.548 | 0.479 |
human_vs_machine
task: its performance is only evaluated forreddit
corpus.
python src/score.py --task=eval_human_vs_machine -p=restore/human_vs_machine.pth --data=test/human_vs_fake/reddit
# expecting accuracy ~0.98
If you use our dataset or model, please cite our paper
@inproceedings{gao2020dialogrpt,
title={Dialogue Response RankingTraining with Large-Scale Human Feedback Data},
author={Xiang Gao and Yizhe Zhang and Michel Galley and Chris Brockett and Bill Dolan},
year={2020},
booktitle={EMNLP}
}