Note: When reporting results from eval harness, please include the task versions (shown in
results["versions"]
) for reproducibility. This allows bug fixes to tasks while also ensuring that previously reported scores are reproducible. See the Task Versioning section for more info.
To evaluate a model hosted on the HuggingFace Hub (e.g. GPT-J-6B) on hellaswag
you can use the following command:
python main.py \
--model hf-causal \
--model_args pretrained=EleutherAI/gpt-j-6B \
--tasks hellaswag \
--device cuda:0
Additional arguments can be provided to the model constructor using the --model_args
flag. Most notably, this supports the common practice of using the revisions
feature on the Hub to store partially trained checkpoints, or to specify the datatype for running a model:
python main.py \
--model hf-causal \
--model_args pretrained=EleutherAI/pythia-160m,revision=step100000,dtype="float" \
--tasks lambada_openai,hellaswag \
--device cuda:0
To evaluate models that are loaded via AutoSeq2SeqLM
in Huggingface, you instead use hf-seq2seq
. To evaluate (causal) models across multiple GPUs, use --model hf-causal-experimental
Warning: Choosing the wrong model may result in erroneous outputs despite not erroring.
@software{eval-harness,
author = {Gao, Leo and
Tow, Jonathan and
Biderman, Stella and
Black, Sid and
DiPofi, Anthony and
Foster, Charles and
Golding, Laurence and
Hsu, Jeffrey and
McDonell, Kyle and
Muennighoff, Niklas and
Phang, Jason and
Reynolds, Laria and
Tang, Eric and
Thite, Anish and
Wang, Ben and
Wang, Kevin and
Zou, Andy},
title = {A framework for few-shot language model evaluation},
month = sep,
year = 2021,
publisher = {Zenodo},
version = {v0.0.1},
doi = {10.5281/zenodo.5371628},
url = {https://doi.org/10.5281/zenodo.5371628}
}