Name		Name	Last commit message	Last commit date
parent directory ..
configs		configs
data		data
eval		eval
src		src
eval.sh		eval.sh
evaluate.py		evaluate.py
readme-zh.md		readme-zh.md
readme.md		readme.md
requirements.txt		requirements.txt

readme.md

Readme

中文版(Chinese)

set-up

Verify that you have installed docker and can run docker commands without sudo.

docker --version
docker image list
docker ps

Docker-based tasks

Simply pull the docker image

# For Held-in tasks
docker pull learningrate/agentbench-alfworld
docker pull learningrate/agentbench-webshop
docker pull learningrate/agentbench-mind2web
# For Held-out task
docker pull learningrate/agentbench-card_game

Other tasks

First install the global requirements

pip install -r requirements.txt

For OSInteraction task

Install requirements and create local images (5 ~ 10 minutes)

pip install -r src/tasks/os_interaction/requirements.txt
python src/tasks/os_interaction/images.py build -c configs/tasks/os_interaction/std.yaml -r .

Run the following command to test OS task:

python evaluate.py \
    --task configs/tasks/os_interaction/std.yaml \
    --agent configs/agents/do_nothing.yaml \
    --workers 30

For DB task

Install docker and prepare mysql image, and make sure you have already installed global requirements.

pip install -r src/tasks/dbbench/requirements.txt

Run the following command to test DB task (To avoid docker problem, we do not recommend run with too many workers)

python evaluate.py \
    --task configs/tasks/dbbench/std.yaml \
    --agent configs/agents/do_nothing.yaml \
    --workers 5

For KG task

Follow Freebase Setup to start your own Virtuoso server. Then replace sparql_url with the link to your own server in the config files. (Caveat: You may try the default sparql_url without touching this, but it is not always guaranteed that our Virtuoso server is active.)

Install necessary Python packages:

pip install -r src/tasks/knowledgegraph/requirements.txt

Run the following command to test KG task

python evaluate.py \
    --task configs/tasks/knowledgegraph/std.yaml \
    --agent configs/agents/do_nothing.yaml \
    --workers 30

TGI config

When setting up TGI, you can add more port(s) in /configs/agents/tgi_clients/AgentLM-{7b,13b,70b}.yaml for faster evaluation.

Evaluation

Running the bash file for evaluating AgentLM-{7b,13b,70b}

bash eval/AgentLM-7b-eval-all.sh
bash eval/AgentLM-13b-eval-all.sh
bash eval/AgentLM-70b-eval-all.sh

After evaluation, result of each task will be stored in outputs/AgentLM-{7b,13b,70b}/{timestamp}/{task}/results.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AgentBench.old

AgentBench.old

readme.md

Readme

set-up

Docker-based tasks

Other tasks

TGI config

Evaluation

Files

AgentBench.old

Directory actions

More options

Directory actions

More options

Latest commit

History

AgentBench.old

Folders and files

parent directory

readme.md

Readme

set-up

Docker-based tasks

Other tasks

TGI config

Evaluation