Skip to content

as-ideas/MMLU

Repository files navigation

MMLU Evaluation

This repository contains code for running the MMLU (Massive Multitask Language Understanding) evaluation of large language models. It is recoded from scratch following the logic of the original repo with following imrovements:

  • Accellerated inference: Using multithreaded API calls.
  • Enhanced stability: Added timeouts and retries for API calls.
  • Modularity: You can easily evaluate your custom LLM (see Evaluate your custom model).

Setup

  1. Download the dataset here

  2. Install the required dependencies:

pip install -r requirements.txt
  1. Ensure you have the necessary environment variables if you want to use OpenAI or Azure models. E.g. for Azure:
export OPENAI_API_BASE=https://your-azure-endpoint.com 
export OPENAI_API_KEY=your-azure-key

Usage

Run the evaluation code. The results are stored as *.csv files in the given directory.

python evaluate_azure.py --data_dir path-to-data --result_dir path-to-results --k_shot 0

Evaluate your custom model

To evaluate a custom LLM simply use the following template and replace the predict_function by your own Callable:

from pathlib import Path
from mmlu.evaluation import predict_dataset, evaluate_results


def predict_function(prompt: str) -> str:
    return 'A'


if __name__ == '__main__':
    data_dir = Path('data')
    result_dir = Path('results')
    predict_dataset(data_dir=data_dir,
                    result_dir=result_dir,
                    predict_function=predict_function,
                    k_shot=0)
    evaluate_results(result_dir=result_dir)

Languages other than English

Evaluating on other languages

We will provide additional datasets (starting with German) that are translated via Azure and can be used ad-hoc with the standard evaluation script - simply point to the translated data.

A translated dataset is formatted in the same way as the original dataset but contains an additional file subjects.json that includes the translated prompt header and subjects:

data_de/
├── dev/
├── test/
├── subjects.json

For German, the subjects.json looks like:

{
  "header": "Im Folgenden finden Sie Multiple-Choice-Fragen (mit Antworten) zum Thema",
  "answer": "Antwort", 
  "subjects": {
    "abstract_algebra": "abstrakte Algebra", 
    "astronomy": "Astronomie",
    ...
  }
}

Translating the dataset to another language

You can use the translation script that calls the Azure translation service:

export AZURE_ENDPOINT=your-azure-translation-endpoint
export AZURE_KEY=your-azure-key
export AZURE_REGION=your-azure-region
PYTHONPATH=. python mmlu/translate --data_dir data --target_dir /tmp/data_de --lang de

The translated data will be stored in target_dir in the format described above. Note that only dev and test data will be translated.

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages