A conversational benchmark for factual consistency
This is the repository for the paper: What Was Your Name Again? Interrogating Generative Conversational Models For Factual Consistency Evaluation
With the code you can evaluate the factual consistency for a generative model using the dataset and hybrid pipeline described in paper.
- Finetune the model on PersonaChat train set (not necessary but for a fair comparisson).
- Run:
python evaluate.py --model_checkpoint={path to your model} --model_sep_tokens={your model's special tokens}
. The second parameter is the list of potential special tokens used in finetuning to separate agent and user utterances, and defaults to ["<user>", "<agent>"]. - The code first curates the Consistest dataset and then runs inference and evaluation. The results are saved in the
results
folder. Note that the code assumes your model and tokenizer can be read using the.from_pretrained()
method intransformers
library.
You can also de the evaluation on already generated responses (as a .csv
file) by running: python evaluate.py --eval_only=True --responses_to_eval={path to response file}
.
Note that the .csv
file should follow the expected format; i.e. Question
, Response
, Reference
, NLI_Ref
, Source
, Type
, Distance
as columns.
In both cases since the code uses the trainer
method from transformers
for NLI inference, all available GPUs will be used.
This can be changed with CUDA_VISIBLE_DEVICES={list of GPU ids}
in front of the python evaluate.py
line.