Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perplexity Metric Card #3905

Merged
merged 8 commits into from
Mar 16, 2022
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions metrics/perplexity/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Metric Card for Perplexity

## Metric Description
Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.

## Intended Uses
Any language generation task.

## How to Use

emibaylor marked this conversation as resolved.
Show resolved Hide resolved
```python
from datasets import load_metric
exact_match_metric = load_metric("perplexity")
results = exact_match_metric.compute(predictions=predictions, references=references)
emibaylor marked this conversation as resolved.
Show resolved Hide resolved
```

### Inputs
- **model_id** (str): model used for calculating Perplexity. NOTE: Perplexity can only be calculated for causal language models.
- This includes models such as gpt2, causal variations of bert, causal versions of t5, and more (the full list can be found in the AutoModelForCausalLM documentation here: https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForCausalLM )
- **input_texts** (list of str): input text, each separate text snippet is one list entry. Perplexity returned will be an average of the perplexity for each list entry.
- **stride** (int): stride size, defaults to 512
- **device** (str): device to run on, defaults to 'cuda' when available

### Output Values
This metric outputs a dictionary with one value: the average perplexity score for the text input in the list.

```
{'perplexity': 117.9}
```

This metric's range is 0 and up. A lower score is better.

### Examples
Calculating perplexity on input_texts defined here:
```python
perplexity = datasets.load_metric("perplexity")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
results = perplexity.compute(model_id='gpt2',
input_texts=input_texts,
stride=1)
round(results["perplexity"], 1)
>>> 78.2
```
Calculating perplexity on input_texts loaded in from a dataset:
```python
perplexity = datasets.load_metric("perplexity")
input_texts = datasets.load_dataset("wikitext",
"wikitext-2-raw-v1",
split="test")["text"][:10]

results = perplexity.compute(model_id='gpt2',
input_texts=input_texts,
stride=256)
round(results["perplexity"], 1)
>>> 117.9
```

## Limitations and Bias
Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.

## Citation