π οΈ Setup Β |Β
π€ Assistant Β |Β
π Launch Experiments Β |Β
π Analyse Results Β |Β
π Leaderboard Β |Β
π€ Build Your Agent Β |Β
β» Reproducibility
4x4.grid.mp4
AgentLab is a framework for developing and evaluating agents on a variety of benchmarks supported by BrowserGym.
AgentLab Features:
- Easy large scale parallel agent experiments using ray
- Building blocks for making agents over BrowserGym
- Unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI.
- Preferred way for running benchmarks like WebArena
- Various reproducibility features
- Unified LeaderBoard
Benchmark | Setup Link |
# Task Template |
Seed Diversity |
Max Step |
Multi-tab | Hosted Method | BrowserGym Leaderboard |
---|---|---|---|---|---|---|---|
WebArena | setup | 812 | None | 30 | yes | self hosted (docker) | soon |
WorkArena L1 | setup | 33 | High | 30 | no | demo instance | soon |
WorkArena L2 | setup | 341 | High | 50 | no | demo instance | soon |
WorkArena L3 | setup | 341 | High | 50 | no | demo instance | soon |
WebLinx | - | 31586 | None | 1 | no | self hosted (dataset) | soon |
VisualWebArena | setup | 910 | None | 30 | yes | self hosted (docker) | soon |
AssistantBench | setup | 214 | None | 30 | yes | live web | soon |
GAIA (soon) | - | - | None | - | - | live web | soon |
Mind2Web-live (soon) | - | - | None | - | - | live web | soon |
MiniWoB | setup | 125 | Medium | 10 | no | self hosted (static files) | soon |
pip install agentlab
If not done already, install Playwright:
playwright install
Make sure to prepare the required benchmark according to the instructions provided in the setup column.
export AGENTLAB_EXP_ROOT=<root directory of experiment results> # defaults to $HOME/agentlab_results
export OPENAI_API_KEY=<your openai api key> # if openai models are used
Setup OpenRouter API
export OPENROUTER_API_KEY=<your openrouter api key> # if openrouter models are used
Setup Azure API
export AZURE_OPENAI_API_KEY=<your azure api key> # if using azure models
export AZURE_OPENAI_ENDPOINT=<your endpoint> # if using azure models
Use an assistant to work for you (at your own cost and risk).
agentlab-assistant --start_url https://www.google.com
Try your own agent:
agentlab-assistant --agent_config="module.path.to.your.AgentArgs"
# Import your agent configuration extending bgym.AgentArgs class
# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle
from agentlab.agents.generic_agent import AGENT_4o_MINI
from agentlab.experiments.study import make_study
study = make_study(
benchmark="miniwob", # or "webarena", "workarnea_l1" ...
agent_args=[AGENT_4o_MINI],
comment="My first study",
)
study.run(n_jobs=5)
Relaunching incomplete or errored tasks
from agentlab.experiments.study import Study
study = Study.load("/path/to/your/study/dir")
study.find_incomplete(include_errors=True)
study.run()
See main.py to launch experiments with a variety of options. This is like a lazy CLI that is actually more convenient. Just comment and uncomment the lines you need or modify at will (but don't push to the repo).
The complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This disables workers until the study is terminated and relaunched. If you are running jobs sequentially or with a small number of workers, this could halt your entire study until you manually kill and relaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs exceeding a specified timeout. This feature is particularly useful when task hanging limits your experiments.
For debugging, run experiments with n_jobs=1
and use VSCode's debug mode. This allows you to pause
execution at breakpoints.
Running one agent on one task corresponds to a single job. Conducting ablation studies or random searches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient parallel execution is therefore critical. Agents typically wait for responses from the LLM server or updates from the web server. As a result, you can run 10β50 jobs in parallel on a single computer, depending on available RAM.
make_study
function returns a SequentialStudies
object to ensure proper sequential evaluation of
each agent. AgentLab currently does not support evaluations across multiple instances, but you could
either create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel
experience, consider using benchmarks like WorkArena instead.
The class ExpResult
provides a lazy loader for all the information of a specific experiment. You can use yield_all_exp_results
to recursively find all results in a directory. Finally load_result_df
gathers all the summary information in a single dataframe. See inspect_results.ipynb
for example usage.
from agentlab.analyze import inspect_results
# load the summary of all experiments of the study in a dataframe
result_df = inspect_results.load_result_df("path/to/your/study")
# load the detailed results of the 1st experiment
exp_result = bgym.ExpResult(result_df["exp_dir"][0])
step_0_screenshot = exp_result.screenshots[0]
step_0_action = exp_result.steps_info[0].action
AgentXray.demo.mov
In a terminal, execute:
agentlab-xray
You can load previous or ongoing experiments in the directory AGENTLAB_EXP_ROOT
and visualize
the results in a gradio interface.
In the following order, select:
- The experiment you want to visualize
- The agent if there is more than one
- The task
- And the seed
Once this is selected, you can see the trace of your agent on the given task. Click on the profiling image to select a step and observe the action taken by the agent.
Official unified leaderboard across all benchmarks.
Experiments are on their way for more reference points using GenericAgent. We are also working on code to automatically push a study to the leaderboard.
Get inspiration from the MostBasicAgent
in
agentlab/agents/most_basic_agent/most_basic_agent.py.
For a better integration with the tools, make sure to implement most functions in the
AgentArgs API and the extended bgym.AbstractAgentArgs
.
If you think your agent should be included directly in AgenLab, let us know and it can be added in agentlab/agents/ with the name of your agent.
Several factors can influence reproducibility of results in the context of evaluating agents on dynamic benchmarks.
- Software version: Different versions of Playwright or any package in the software stack could influence the behavior of the benchmark or the agent.
- API-based LLMs silently changing: Even for a fixed version, an LLM may be updated e.g. to incorporate the latest web knowledge.
- Live websites:
- WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow sometimes pushes minor modifications.
- AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may change depending on which country or region, some websites might be in different languages by default.
- Stochastic Agents: Setting the temperature of the LLM to 0 can reduce most stochasticity.
- Non-deterministic tasks: For a fixed seed, the changes should be minimal
Study
contains a dict of information about reproducibility, including benchmark version, package version and commit hash- The
Study
class allows automatic upload of your results toreproducibility_journal.csv
. This makes it easier to populate a large amount of reference points. For this feature, you need togit clone
the repository and install viapip install -e .
. - Reproduced results in the leaderboard. For agents that are reprocudibile, we encourage users to try to reproduce the results and upload them to the leaderboard. There is a special column containing information about all reproduced results of an agent on a benchmark.
- ReproducibilityAgent: You can run this agent on an existing study and it will try to re-run the same actions on the same task seeds. A visual diff of the two prompts will be displayed in the AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes between the two executions. Note: this is a beta feature and will need some adaptation for your own agent.
if you want to download HF models more quickly
pip install hf-transfer
pip install torch
export HF_HUB_ENABLE_HF_TRANSFER=1