SeaExam: Benchmarking LLMs for Southeast Aisa languages with Human Exam Questions

This repo contains code for SeaExam, a toolkit for evaluating large language models (LLM) for Southeast Asian (SEA) languages including Chinese, English, Indonesian, Thai, and Vietnamese.

The evaluation dataset consists of M3Exam and translated MMLU datsets. For more information, refer to the huggingface dataset page.

Please also check SeaBench dataset here for more evaluation tasks on SEA languages.

Setup enironment

git clone https://github.com/DAMO-NLP-SG/SeaExam.git
cd SeaExam
conda create -n SeaExam python=3.9
conda activate SeaExam
pip install -r requirement.txt

Evaluate models

To quickly evaluate your model on SeaExam, just run

python scripts/main.py --model $model_name_or_path

For example:

python scripts/main.py --model SeaLLMs/SeaLLMs-v3-7B-Chat

Or

bash quick_run.sh

Under the hood

Our goal is to ensure a fair and consistent comparison across different LLMs while mitigating the risk of data contamination.

To ensure a fair comparison and reduce LLMs' dependence on specific prompt templates, we have designed several templates. If dynamic_template is set as True (which is the default setting), a template will be randomly selected for each question. Additionally, users have the option to change the seed value to generate a different set of questions for evaluation purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
quick_run.sh		quick_run.sh
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeaExam: Benchmarking LLMs for Southeast Aisa languages with Human Exam Questions

Setup enironment

Evaluate models

Under the hood

About

Releases

Packages

Languages

DAMO-NLP-SG/SeaExam

Folders and files

Latest commit

History

Repository files navigation

SeaExam: Benchmarking LLMs for Southeast Aisa languages with Human Exam Questions

Setup enironment

Evaluate models

Under the hood

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages