MMLU Evaluation

This repository contains code for running the MMLU (Massive Multitask Language Understanding) evaluation of large language models. It is recoded from scratch following the logic of the original repo with following imrovements:

Accellerated inference: Using multithreaded API calls.
Enhanced stability: Added timeouts and retries for API calls.
Modularity: You can easily evaluate your custom LLM (see Evaluate your custom model).

Setup

Download the dataset here
Install the required dependencies:

pip install -r requirements.txt

Ensure you have the necessary environment variables if you want to use OpenAI or Azure models. E.g. for Azure:

export OPENAI_API_BASE=https://your-azure-endpoint.com 
export OPENAI_API_KEY=your-azure-key

Usage

Run the evaluation code. The results are stored as *.csv files in the given directory.

python evaluate_azure.py --data_dir path-to-data --result_dir path-to-results --k_shot 0

Evaluate your custom model

To evaluate a custom LLM simply use the following template and replace the predict_function by your own Callable:

from pathlib import Path
from mmlu.evaluation import predict_dataset, evaluate_results


def predict_function(prompt: str) -> str:
    return 'A'


if __name__ == '__main__':
    data_dir = Path('data')
    result_dir = Path('results')
    predict_dataset(data_dir=data_dir,
                    result_dir=result_dir,
                    predict_function=predict_function,
                    k_shot=0)
    evaluate_results(result_dir=result_dir)

Languages other than English

Evaluating on other languages

We will provide additional datasets (starting with German) that are translated via Azure and can be used ad-hoc with the standard evaluation script - simply point to the translated data.

A translated dataset is formatted in the same way as the original dataset but contains an additional file subjects.json that includes the translated prompt header and subjects:

data_de/
├── dev/
├── test/
├── subjects.json

For German, the subjects.json looks like:

{
  "header": "Im Folgenden finden Sie Multiple-Choice-Fragen (mit Antworten) zum Thema",
  "answer": "Antwort", 
  "subjects": {
    "abstract_algebra": "abstrakte Algebra", 
    "astronomy": "Astronomie",
    ...
  }
}

Translating the dataset to another language

You can use the translation script that calls the Azure translation service:

export AZURE_ENDPOINT=your-azure-translation-endpoint
export AZURE_KEY=your-azure-key
export AZURE_REGION=your-azure-region
PYTHONPATH=. python mmlu/translate --data_dir data --target_dir /tmp/data_de --lang de

The translated data will be stored in target_dir in the format described above. Note that only dev and test data will be translated.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
mmlu		mmlu
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_azure.py		evaluate_azure.py
evaluate_flan.py		evaluate_flan.py
evaluate_random.py		evaluate_random.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMLU Evaluation

Setup

Usage

Evaluate your custom model

Languages other than English

Evaluating on other languages

Translating the dataset to another language

References

About

Releases

Packages

Languages

License

as-ideas/MMLU

Folders and files

Latest commit

History

Repository files navigation

MMLU Evaluation

Setup

Usage

Evaluate your custom model

Languages other than English

Evaluating on other languages

Translating the dataset to another language

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages