This repository contains the source code for the user model evaluator proposed in the GCDF1: A Goal- and Context- Driven F-Score for Evaluating User Models along with guidance on how to reproduce the results presented in our paper.
Currently, the evaluator only supports MultiWOZ 2.1 dataset and there are no plans to extend to other MultiWOZ versions or SGD unless there is community interest in doing so - see the Invited Contributions section for details about this project.
This assumes you have anaconda3
or miniconda3
installed on your system. To set up the environment
- run the command below to create the virtual environment
conda env create -f environment.lock.yml
- activate the new environment with:
conda activate gcdf1
NOTE: The conda environment will have gcdf1 installed in editable mode. Some changes, e.g. in
setup.cfg
, might require you to runpip install -e .
again.
For those interested in contributing to the project, needed only once after git clone
:
- update the environment to contain packages needed during development
conda env update -f dev_environment.yml --prune
4 .install several pre-commit git hooks with:
pre-commit install
# You might also want to run `pre-commit autoupdate`
and checkout the configuration under .pre-commit-config.yaml
.
The -n, --no-verify
flag of git commit
can be used to deactivate pre-commit hooks temporarily.
- install nbstripout git hooks to remove the output cells of committed notebooks with:
This is useful to avoid large diffs due to plots in your notebooks. A simple
nbstripout --install --attributes getting_started/.gitattributes
nbstripout --uninstall
will revert these changes.
We provide conversations generated according to the setup described in Section 4.1 of our paper
for the MultiWOZ 2.1 test set in the models/convlab_baselines/test
directory.
The evaluator can be run with the evaluate
command after the environment has been activated, so type
evaluate --helpfull
and checkout the gcdf1.evaluate_user
section to understand the CLI for our evaluator. The
command to generate the baseline.json
file in models/convlab_baselines
folder is
evaluate --prediction_dir models/convlab_baselines/test -o baseline.json -c configs/multiwoz_user_evaluator.yaml -cmap resources/multiwoz21_canonical_map.json
and should be run from the repository root.
To familiarise yourself with the conversation format, have a close look at the conversations
inside the resources/sample_conversation.json
folder. These conversations largely follow the
Schema-guided Dataset
format here. We obtained these conversations in two steps:
- Obtain conversations between a
Convlab2
system and user model in the same format as MultiWOZ 2.1 - Convert the conversations to SGD format using a modified version of the
create_data_from_multiwoz.py
script provided here
Our script added the following information to each dialogue:
final_goal
: dict storing the final goal afterConvlab2
processing. This is NOT used by the evaluatorgoal
: dict, storing the dialogue goal, in the same format as theConvlab2
goal model for MultiWOZ 2.1metrics
: metrics computed byConvlab2
for the conversation these are NOT used by the evaluator
We also added the following field to each turn:
nlu
. This field contains a single keyframes
with the same structure as the turn (i.e., anactions
key which stores a list of actions recognised by the agent and aservice
key which indicates the domain). The evaluator will still work if this is not in the schema - NLU validation features should be disabled in the evaluator config in this case.
The evaluator outputs a complex hierarchical structure containing the behaviours detected for each metric. See the
documentation of the gcdf1.utils.multiwoz_output.py
module for an overview of the behaviours.
Make sure you have activated the gcdf1
environment and install jupyter
with the command
pip install jupyter
Start a jupyter notebook in your browser from the root. Then
navigate to the getting_started
folder and run the IGCDF1.ipynb
and RGCDF1.ipynb
notebooks.
When developing the library, we recognised the need to keep the implementation general to ensure compatibility with all MultiWOZ versions and other datasets. However, for pragmatic reasons, not all the implementation is agnostic to MultiWOZ and minor refactoring is required so that some of the functions work correctly irrespective of the dataset. If you are interested to help extend the framework review the code and get in touch - the code author is happy to guide and support your endeavour and thus provide the community with a tool to reliably measure the performance of their user models.
The canonical map for the MultiWOZ 2.1 version contains over 10000 value paraphrases extracted from the corpus. It
is thus straightforward to implement a more robust slot-error rate metric. The author is happy to provide basic code for
this and implementation guidance. The starting point could be the value_generated
function implemented in the
gcdf1.utils.evaluator
module. Integration of NLG metrics proposed in this paper
is also desired.
Any SER-like metric can be integrated in the current framework to compute F1 scores based on natural language as opposed to generated actions - get in touch to discuss this!
Some basic test examples are included in the tests/
folder. The library contains various TODOs
with functions that
are not unit-tested. If the environment has been updated with the development dependencies as detailed in the
installation section, then you can run the tests from the root by simply invoking tox
or py.test
commands.
├── AUTHORS.md <- List of developers and maintainers.
├── CHANGELOG.md <- Changelog to keep track of new features and fixes.
├── LICENSE.txt <- License as chosen on the command-line.
├── README.md <- The top-level README for developers.
├── configs <- Directory containing the evaluator config files.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── docs <- Directory for Sphinx documentation in rst or md.
├── environment.yml <- The conda environment file for reproducibility.
├── dev_environment.yml <- The conda environment file containing developer dependencies
├── models <- Trained and serialized models, model predictions,
│ or model summaries. Each model has its own folder (e.g., convlab_baselines)
│ and the test/dev/train subdirectories are expected in each model directory.
├── getting_started <- Jupyter notebooks that can be run to reproduce paper results.
├── pyproject.toml <- Build system configuration. Do not change!
├── references <- Data dictionaries, manuals, and all other materials.
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated plots and figures for reports.
├── resources <- Contains the MultiWOZ2.1 canonical map file (multiwoz21_canonical_map.json) and the
│ json file that it was generated from (multiwoz21_raw_canonical_map.json) using
│ the script scripts/preprocess_multiwoz_canonical_map.py
├── scripts <- This contains the get_multiwoz21_raw_canonical_map.py script
│ that can be used to generate a new "raw" canonical map and the
│ preprocess_multiwoz_canonical_map.py which can be used to convert the raw canonical map
│ to a format compatible with the evaluator.
├── setup.cfg <- Declarative configuration of the project.
├── setup.py <- Use `pip install -e .` to install for development or
│ or create a distribution with `tox -e build`.
├── src
│ └── gcdf1 <- Actual Python package where the evaluator is implemented
├── tests <- Unit tests which can be run with `py.test`.
├── .coveragerc <- Configuration for coverage reports of unit tests.
├── .isort.cfg <- Configuration for git hook that sorts imports.
└── .pre-commit-config.yaml <- Configuration of pre-commit git hooks.
If you use any of the code in this repository or any of our methods, please cite our paper:
@inproceedings{coca-etal-2021-gcdf1,
title = "{GCDF}1: A Goal- and Context- Driven {F}-Score for Evaluating User Models",
author = "Coca, Alexandru and
Tseng, Bo-Hsiang and
Byrne, Bill",
booktitle = "The First Workshop on Evaluations and Assessments of Neural Conversation Systems",
month = nov,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.eancs-1.2",
pages = "7--14",
abstract = "The evaluation of dialogue systems in interaction with simulated users has been proposed to improve turn-level, corpus-based metrics which can only evaluate test cases encountered in a corpus and cannot measure system{'}s ability to sustain multi-turn interactions. Recently, little emphasis was put on automatically assessing the quality of the user model itself, so unless correlations with human studies are measured, the reliability of user model based evaluation is unknown. We propose GCDF1, a simple but effective measure of the quality of semantic-level conversations between a goal-driven user agent and a system agent. In contrast with previous approaches we measure the F-score at dialogue level and consider user and system behaviours to improve recall and precision estimation. We facilitate scores interpretation by providing a rich hierarchical structure with information about conversational patterns present in the test data and tools to efficiently query the conversations generated. We apply our framework to assess the performance and weaknesses of a Convlab2 user model.",
}
This project has been set up using PyScaffold 4.0.1 and the dsproject extension 0.6.1.