This repository contains the GameTraversalBenchmark (GTB), which is explained in the paper GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps.
Install via repository:
git clone https://github.com/umair-nasir14/Game-Traversal-Benchmark.git GTB
cd GTB
If you want to replicate experiments in the paper or you want to evaluate you LLM on the benchmark then install via environment.yml
:
conda env create -f environment.yml
conda activate gtbench
If you want to explore and get the data only:
pip install -r requirements.txt
To get data, simply:
from gtb.data import get_data
benchmark_data = get_data()
for i, data in enumerate(benchmark_data[:1]):
print(data["environment"])
Also, save_data()
is provided to save your results.
To replicate results in the paper, create a .env
file and add the API keys:
OPENAI_API_KEY="sk..."
GROQ_API_KEY= ...
and install relevent library, such as pip install openai
from gtb.gtbench import run
from gtb.evaluations import compile_results
model_name = "o1-preview-2024-09-12"
experiment_name = run(model_name)
compile_results(model_name, experiment_name)
This will give you a model specific result file with the name as {model}_{experiment_name}.json
that will contain results for each row of data. There will be another file that will have all results combined to have GTB_Score
and other scores.
Models tested in the paper:
"o1-preview-2024-09-12"
"o1-mini-2024-09-12"
"gpt-3.5-turbo-0125"
"gpt-4-0613"
"gpt-4-turbo-2024-04-09"
"gpt-4o-2024-05-13"
"claude-3-opus-20240229"
"claude-3-sonnet-20240229"
"claude-3-haiku-20240307"
"llama3-8b-8192"
"llama3-70b-8192"
"mixtral-8x7b-32768"
"gemma-7b-it"
Model | GTBS(↑) | MGE(↓) | MPL(↓) | MAT(↓) | Top-0 Acc.(↑) | Top-1 Acc.(↑) | Top-5 Acc.(↑) |
---|---|---|---|---|---|---|---|
O1 | |||||||
O1-mini | |||||||
GPT-4-Turbo | |||||||
GPT-4-o | |||||||
Claude-3-Opus | |||||||
Claude-3-Sonnet | |||||||
Random-FP | N/A | N/A | N/A | ||||
Gemma-7B | |||||||
GPT-3.5-Turbo | |||||||
LLaMa-3-8B | |||||||
LLaMa-3-70B | |||||||
Claude-3-Haiku | |||||||
Mixtral-8x7B | |||||||
Random-RP | N/A |
@article{nasir2024gametraversalbenchmark,
title={GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps},
author={Nasir, Muhammad Umair and James, Steven and Togelius, Julian},
journal={arXiv preprint arXiv:2410.07765},
year={2024}
}