- [2024.01.15] 📣 AgentBoard is released.
- [2024.03.11] 🥳 AgentBoard is accepted by LLMAgents @ ICLR 2024
AgentBoard emphasizes analytical evaluation for Large Language Models (LLMs) as generalist agents to perceive and act within various environments. It outlines four principles for constructing a benchmark to evaluate LLMs as generalist agents:
- Task Diversity: AgentBoard incorporates 9 distinct tasks to comprehensively understand the generalist ability of LLM agents, which is built upon LLM's extensive knowledge base and exceptional scenario comprehension.
- Multi-round Intercation: AgentBoard provides multi-round interaction between agents and environment, which is necessary to reflect the evolutionary nature of human intelligence, which continuously receives information and adapts towards the environment.
- Partially-Observable Environments: In AgentBoard, the complete state of the environment is not available to the agent, which assesses agent world modeling ability as additional knowledge needs to be acquired through online exploration.
- Analytical Evaluation: AgentBoard is a systematic evaluation platform: it includes a user-friendly script to construct goal-oriented reflex agents for a range of models, and features a panel for visualizing and interpreting results across multiple dimensions of agent proficiency, including fine-grained progress rates, grounding accuracy, performance breakdown for hard and easy examples, long-range in- teractions, detailed performance across various sub-skills, and trajectory with friendly visualization
Click to expand the table of contents
Here we provide a quick start guide to evaluate LLM agents on AgentBoard within 30 minutes.
We provide both local setup (recommended) and docker as follows:
Click to expand local setup procedures (~ 15 minutes).
Setup with a setup.sh:
Step 1. Create a conda environment
conda create -n ${YOUR_ENV_NAME} python=3.8.13 # python version should be 3.8.13
conda activate ${YOUR_ENV_NAME}
Step 2. Git clone this repo
git clone https://github.com/hkust-nlp/AgentBoard.git
Step 3. Download the data from huggingface
# Download the data and move it to the project root dir
cd AgentBoard
mkdir data
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
tar -zxvf data.tar.gz
Step 4. Set up the environment for tasks except WebArena
INSTALL_WEBARENA=false bash ./setup.sh
# After running the above command, the env will support other tasks than WebArena
Step 5. Set up the environment for WebArena
# Please check whether the dubs and Xvfb are installed before building it
# For Ubuntu or Debian
dpkg -l | grep dbus # will return the info
systemctl status dbus # will return the status(active (running))
dpkg -l | grep xvfb # will return the info
#-----------------------------------------------------------------------#
# For CentOS
yum list installed | grep Xvfb # will return the Xvfb info
systemctl status dbus # will return the status(active (running))
dnf list installed | grep dbus # will return the dbus info
If so, you may install the webarena environment directly.
INSTALL_WEBARENA=true bash ./setup.sh
If not, please jump to Step 6 or Installation by Docker
(Additional) Step 6. Install the dubs and Xvfb
# You must use the sudo permission to do the following:
# For Ubuntu or Debian
# Install and start the dbus service
apt-get install dbus
/etc/init.d/dbus start
# Install ans start the Xvfb
sudo apt-get update
sudo apt-get install xvfb
INSTALL_WEBARENA=true bash ./setup.sh
#--------------------------------------------------------#
# For Centos
# Install and start the dbus service
yum install -y dbus-x11
/etc/init.d/dbus start
# Install ans start the Xvfb
yum update
yum install -y Xvfb
INSTALL_WEBARENA=true bash ./setup.sh
Click to expand docker setup procedures. (~12G, 5 minutes)
Docker info: CentOS
Step 1. Pull the docker image and run docker locally
docker pull zzh202121/agentboard:0117
docker run -itd \
--gpus all \
--network host \
--name agent_space \
--shm-size 64gb \
-v /MODEL_PATH:/model_download \
-v /DATA_PATH:/data \
zzh202121/agentboard:0117 \
/bin/bash
docker attach agent_space # YOUR_CONTAINER_NAME
Step 2. activate the env
conda activate agentboard
Step 3. Download the code and data
git clone https://github.com/hkust-nlp/AgentBoard.git # clone repo
# Download the data and move it to the project root dir
cd AgentBoard
mkdir data
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
tar -zxvf data.tar.gz
Step 3. Build search engine index(For WebShop)
cd ./agentboard/environment/WebShop/search_engine
mkdir -p resources resources_100 resources_1k resources_100k
python convert_product_file_format.py # convert items.json => required doc format
mkdir -p indexes
./run_indexing.sh
cd ../../../
Step 4. Start web service(For Webarena)
/etc/init.d/dbus start # start dbus
Xvfb :99 -screen 0 1280x720x24 & # start xvfb display
export DISPLAY=:99
python -m playwright install
Environment Variables needed for AgentBoard include:
PROJECT_PATH = {path to project}/AgentBoard
ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...
TODO_KEY=...
MOVIE_KEY=...
SHEET_EMAIL=...
WANDB_API_KEY=...
Click to expand API key setup procedures.
Variables 1: API keys for Tool tasks
Since API keys for Tool tasks are private, we do not provide them in this repo.
Please follow this detailed guide to get API keys for Tool tasks.
Variables 2: Weights&Bias key for AgentBoard Online Visualization
Please paste WANDB_API_KEY
from here guide in .env
file to login Weights&Bias for AgentBoard Visulization.
Variables 3: API keys for Proprietary models
If you use OpenAI models, please put your API keys in .env
file.
OPENAI_API_TYPE="open_ai"
OPENAI_API_KEY=${YOUR_OPENAI_API_KEY}
If you use Anthropic models, please put your API keys in .env
file.
ANTHROPIC_API_KEY=${YOUR_ANTHROPIC_API_KEY}
Example script for GPT-3.5-Turbo
:
python agentboard/eval_main.py \
--cfg-path eval_configs/main_results_all_tasks.yaml \
--tasks alfworld \
--model gpt-3.5-turbo-0613 \
--wandb \
--log_path ./results/gpt-3.5-turbo-0613 \
--project_name evaluate-gpt-35-turbo-0613 \
--baseline_dir ./data/baseline_results
We now offer configuration for 12 SOTA LLM models (gpt-4
,gpt-3.5-turbo-0613
, text-davinci-003
,claude2
,deepseek-67b
,lemur-70b
, mistral-7b
,codellama-13b(34b)
,llama2-13b(70b)
,vicuna-13b-16k
) and a simple reflex agent based on act-only prompting. You could also customize your own agents and LLMs. Models supported by vLLM should be generally supported in AgentBoard, while different models may require specific prompt templates.
AgentBoard integrates illustrative Weights&Bias visualization to help researchers better systematically analyze LLM agents. You can simply turn on --wandb
switch in the arguments and customize the project_name
and baseline_dir
of your wandb project as the evaluation command above.
Before running, you need to setup wandb login or environment variable as instructed in quick-start. The visualization results would be both stored offline at \wandb
. Normally after executing the evaluation command, you can visualize the live AgentBoard panel online at https://wandb.ai/{your_wandb_id}/{project_name}
. We provide example WandB logging pages for GPT-4, GPT-3.5-Turbo, and DeepSeek-67b.
Note that if your run is not logged online (on a cluster without internet), you could later sync local runs to wandb online with wandb sync [OPTIONS] [PATH]..
as detailed in wandb docs. For more information about the features of the AgentBoard panel, Please kindly check this Blog for more information.
In addition to online results viewing, local logs are automatically stored in {log_path}
. In WebArena, we additionally support more detailed trajectory files, including web page screenshots and network traffic records.
Log file organization:
{log_path}
├── logs # detailed example-wise logs for each task
│ ├── webarena_tracks # WebArena provided rendered HTML files of the execution trace and a './trace' folder which is automatically generated with Playwright
│ │ ├── traces
│ │ │ ├── 102.zip
│ │ ├── render_102.html
│ │ ├── ...
│ ├── alfworld.jsonl # each line is a json dictionary logging the statistics, trajectory, and prompt for each example
│ ├── babyai.jsonl
│ ├── ...
├── all_results.txt # overall metrics for each task
├── dimension.txt # agent capability dimensional scores for current LLM agent
├── alfworld.txt # a general log for example-wise statisitcs for each task
├── babyai.txt
└── ...
AgentBoard is composed of 9 diverse tasks which can be divided into 4 types, including Embodied AI, Game, Web, and Tool:
Embodied AI | Game | Web | Tool |
|
|
|
|
To help researchers quickly understand evaluation data of each task, we provide Dataset Viewer at Huggingface Dataset: 🤗 AgentBoard.
Note: Please download the dataset from the link provided below for the reason that the data in Dataset Viewer is not complete.
You can download the whole evaluation data by running the following command:
wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz
Please uncommpress the file and move the data to AgentBoard/data
.
cd AgentBoard
mkdir data
tar -zxvf data.tar.gz
The file structure of evaluation data is as follows:
Click to expand the file structure
data
├── baseline_results
├── alfworld
│ ├── alfred.pddl # additional data for alfworld
│ ├── alfred.twl2 # additional data for alfworld
│ ├── json_2.1.1 # additional data for alfworld
│ └── test.jsonl
├── babyai
│ └── test.jsonl
├── jericho
│ ├── test.jsonl
│ └── z-machine-games-master # additional data for jericho
├── pddl
│ └── test.jsonl
├── scienceworld
│ └── test.jsonl
├── tool-operation
│ └── test.jsonl
├── tool-query
│ ├── academia # additional data for academia tool
│ └── test.jsonl
├── webarena
│ └── test.jsonl
└── webshop
└── test.jsonl
**We also provide baseline run loggings in data/baseline_results
, which can be used for visualization in our panel. **
For regions with Internet restrictions, to evaluate the Tool-Query, Tool-Operation and WebArena tasks, please make sure that the machine can access the Internet.
You can check whether you have network issues by observing the output during the execution process.
We provide two ways to install the environment of AgentBoard, as specified in QuickStart.
In this section, we provide a script to evaluate the closed-source models on each task.
Please do not forget to set the environment variables (e.g., OPENAI_API_KEY
, ANTHROPIC_API_KEY
) before running the following commands.
We provide a quick start script to evaluate the gpt-3.5-turbo-0613
model on alfworld
task.
python agentboard/eval_main.py \
--cfg-path eval_configs/main_results_all_tasks.yaml \
--tasks alfworld \
--model gpt-3.5-turbo-0613 \
--wandb \
--log_path ./results/gpt-3.5-turbo-0613 \
--project_name evaluate-gpt-35-turbo-0613 \
--baseline_dir ./data/baseline_results
Parameters:
--cfg-path
: The path of the config file, please refer toeval_configs/main_results_all_tasks.yaml
for more details.--tasks
: The tasks to be evaluated, e.g.tool-query
,tool-operation
,webarena
,alfworld
,babyai
,jericho
,pddl
,scienceworld
.--model
: The LLM to be evaluated. We provide some LLM models, including:gpt-3.5-turbo
gpt-3.5-turbo-16k
gpt-4
text-davinci-003
claude2
wandb
: Online visualization will be launched given this parameter. Remove this parameter from the script if you don't need visualization, e.g. during debugging.log_path
: Path to save logs, as specified here.project_name
: Project name for Weights&Bias. This parameter is not necessary when wandb parameter is not used.baseline_dir
: Directory to results files of baseline models you want to compare with during the run.
First, please start the WebShop server by running the following commands:
cd ./agentboard/environment/WebShop
bash ./run_dev.sh
cd ../../..
Then, run the following command to evaluate the gpt-3.5-turbo-0613
model on webshop
task.
python agentboard/eval_main.py \
--cfg-path eval_configs/main_results_all_tasks.yaml \
--tasks webshop \
--model gpt-3.5-turbo-0613 \
--wandb \
--log_path ./results/gpt-3.5-turbo-0613 \
--project_name evaluate-gpt-35-turbo-0613 \
--baseline_dir ./data/baseline_results
In AgentBoard, we have pre-supported the following 8 open-source models, by default we use vLLM
to speed up inference.
llama2-13b
llama2-34b
codellama-13b
codellama-34b
vicuna-13b-16k
lemur-70b
deepseek-67b
mistral-7b
Please refer to
eval_configs/main_results_all_tasks.yaml
for more details about these models.
To evaluate these models, you can run the following command:
python agentboard/eval_main.py \
--cfg-path eval_configs/main_results_all_tasks.yaml \
--tasks ${TASK_NAME} \
--model ${OPEN_SOURCE_MODEL_NAME}
We also provide LLM customizations, please refer to LLM Customization for more details.
Please refer to llm_customization.md for more details about LLM customization.
Please refer to agent_customization.md for more details about agent customization.
The evaluation runtime for a language model depends on the device/API, model, and inference architecture used. In the case of open-source LLMs, the vllm inference speed is approximately 10 times faster than the huggingface pipeline.
To estimate the total time needed for evaluation, you can run a few steps to measure the inference speed and multiply it by the total number of LLM inferences, which is within 15,000 rounds.
The general formula for estimating the total time is 4h * speed
. Here are some examples of our runtime:
Model | Device/API | Inference Architecture | Inference Speed | Total-time |
---|---|---|---|---|
GPT4 | azure API | - | 1.5s/round | 5.5h |
GPT-3.5-Turbo | azure API | - | 1s/round | 3h |
DeepSpeed-67b | 8*V100 | vllm | 5s/round | 18.5h |
Llama2-70b | 8*V100 | vllm | 8s/round | 28h |
Llama2-70b | 4*A100 | vllm | 4s/round | 13.5h |
If you find this repository useful, please consider giving star and citing our paper:
@misc{ma2024agentboard,
title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents},
author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He},
year={2024},
eprint={2401.13178},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
The AgentBoard codebase is licensed under a Apache-2.0 License.
The AgentBoard dataset is licensed under a GNU General Public License, version 2.