This repository contains code for running the MMLU (Massive Multitask Language Understanding) evaluation of large language models. It is recoded from scratch following the logic of the original repo with following imrovements:
- Accellerated inference: Using multithreaded API calls.
- Enhanced stability: Added timeouts and retries for API calls.
- Modularity: You can easily evaluate your custom LLM (see Evaluate your custom model).
-
Download the dataset here
-
Install the required dependencies:
pip install -r requirements.txt
- Ensure you have the necessary environment variables if you want to use OpenAI or Azure models. E.g. for Azure:
export OPENAI_API_BASE=https://your-azure-endpoint.com
export OPENAI_API_KEY=your-azure-key
Run the evaluation code. The results are stored as *.csv files in the given directory.
python evaluate_azure.py --data_dir path-to-data --result_dir path-to-results --k_shot 0
To evaluate a custom LLM simply use the following template and replace the predict_function by your own Callable:
from pathlib import Path
from mmlu.evaluation import predict_dataset, evaluate_results
def predict_function(prompt: str) -> str:
return 'A'
if __name__ == '__main__':
data_dir = Path('data')
result_dir = Path('results')
predict_dataset(data_dir=data_dir,
result_dir=result_dir,
predict_function=predict_function,
k_shot=0)
evaluate_results(result_dir=result_dir)
We will provide additional datasets (starting with German) that are translated via Azure and can be used ad-hoc with the standard evaluation script - simply point to the translated data.
A translated dataset is formatted in the same way as the original dataset but contains an additional file subjects.json
that includes the translated prompt header and subjects:
data_de/
├── dev/
├── test/
├── subjects.json
For German, the subjects.json
looks like:
{
"header": "Im Folgenden finden Sie Multiple-Choice-Fragen (mit Antworten) zum Thema",
"answer": "Antwort",
"subjects": {
"abstract_algebra": "abstrakte Algebra",
"astronomy": "Astronomie",
...
}
}
You can use the translation script that calls the Azure translation service:
export AZURE_ENDPOINT=your-azure-translation-endpoint
export AZURE_KEY=your-azure-key
export AZURE_REGION=your-azure-region
PYTHONPATH=. python mmlu/translate --data_dir data --target_dir /tmp/data_de --lang de
The translated data will be stored in target_dir
in the format described above. Note that only dev
and test
data will be translated.