Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Benchmarks back into agbench and updates to agbench #3711

Merged
merged 30 commits into from
Oct 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
bb7f6c5
added all benchmarks to agbench
husseinmozannar Oct 4, 2024
d1b0044
minor fix
husseinmozannar Oct 4, 2024
5bb063a
Merge branch 'staging' of https://github.com/microsoft/autogen into b…
husseinmozannar Oct 7, 2024
933d79b
revert readme magentic one
husseinmozannar Oct 7, 2024
18899af
revert autogen core changes
husseinmozannar Oct 7, 2024
02c1471
Merge branch 'staging' into benchmark_add_agbench
husseinmozannar Oct 7, 2024
7b78309
make agbench display average scores and average exact match, add scip…
husseinmozannar Oct 7, 2024
358b575
Merge branch 'benchmark_add_agbench' of https://github.com/husseinmoz…
husseinmozannar Oct 7, 2024
72fba1a
minor fixes for assistantbench
husseinmozannar Oct 7, 2024
1c49bd2
fix exact match score
husseinmozannar Oct 8, 2024
9326eef
make run_cmd be able to run things in parallel
husseinmozannar Oct 8, 2024
949097a
add utility to remove missing, dangerous
husseinmozannar Oct 8, 2024
beaba75
add remove missing command to agbench
husseinmozannar Oct 9, 2024
a91ef8b
simple dashboard utility
husseinmozannar Oct 9, 2024
b7d1dfb
update readme, small fixes to remove_missing
husseinmozannar Oct 9, 2024
3f8fc85
Merge branch 'staging' into benchmark_add_agbench
husseinmozannar Oct 9, 2024
a7810d4
uv lock
husseinmozannar Oct 9, 2024
92464ea
fixed mypy issues
husseinmozannar Oct 9, 2024
b7928b9
Merge branch 'main' of https://github.com/microsoft/autogen into benc…
husseinmozannar Oct 9, 2024
7b0fd70
format and restore tomllib
husseinmozannar Oct 9, 2024
694f8c5
Merge branch 'main' into benchmark_add_agbench
husseinmozannar Oct 9, 2024
d75f70e
Merge branch 'main' into benchmark_add_agbench
husseinmozannar Oct 10, 2024
af5de05
Merge branch 'main' into benchmark_add_agbench
rysweet Oct 10, 2024
2782e99
removed flask app
husseinmozannar Oct 10, 2024
a7b6fd0
Merge branch 'main' into benchmark_add_agbench
rysweet Oct 10, 2024
23cab1c
Merge branch 'main' into benchmark_add_agbench
jackgerrits Oct 11, 2024
34ae8ff
Merge branch 'main' into benchmark_add_agbench
husseinmozannar Oct 11, 2024
4ccce02
Merge branch 'main' into benchmark_add_agbench
husseinmozannar Oct 11, 2024
628cb99
remove benchmarks while we figure out licenses
husseinmozannar Oct 11, 2024
af2f42b
Merge branch 'main' into benchmark_add_agbench
husseinmozannar Oct 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/packages/agbench/.gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
scenarios/*/Downloads
scenarios/*/Tasks
*/Results
*/Results
45 changes: 9 additions & 36 deletions python/packages/agbench/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,11 @@ As part of the broader AutoGen project, AutoGenBench welcomes community contribu
We ask that all contributions to AutoGenBench adhere to the following:

- Follow AutoGen's broader [contribution guidelines](https://microsoft.github.io/autogen/docs/Contribute)
- All AutoGenBench benchmarks should live in a subfolder of `/samples/tools/autogenbench/scenarios` alongside `HumanEval`, `GAIA`, etc.
- All AutoGenBench benchmarks should live in a subfolder of `/benchmarks` alongside `HumanEval`, `GAIA`, etc.
- Benchmark scenarios should include a detailed README.md, in the root of their folder, describing the benchmark and providing citations where warranted.
- Benchmark data (tasks, ground truth, etc.) should be downloaded from their original sources rather than hosted in the AutoGen repository (unless the benchmark is original, and the repository *is* the original source)
- You can use the `Scripts/init_tasks.py` file to automate this download.
- Basic scoring should be compatible with the `autogenbench tabulate` command (e.g., by outputting logs compatible with the default tabulation mechanism, or by providing a `Scripts/custom_tabulate.py` file)
- If you wish your benchmark to be compatible with the `autogenbench clone` command, include a `MANIFEST.json` file in the root of your folder.
- Basic scoring should be compatible with the `agbench tabulate` command (e.g., by outputting logs compatible with the default tabulation mechanism, or by providing a `Scripts/custom_tabulate.py` file)

These requirements are further detailed below, but if you simply copy the `HumanEval` folder, you will already be off to a great start.

Expand Down Expand Up @@ -62,16 +61,16 @@ For example:

In this example, the string `__MODEL__` will be replaced in the file `scenarios.py`, while the string `__PROMPT__` will be replaced in the `prompt.txt` file.

The `template` field can also take on a list value, but this usage is considered advanced and is not described here. See the `autogenbench/run_cmd.py` code, or the `GAIA` benchmark tasks files for additional information about this option.
The `template` field can also take on a list value, but this usage is considered advanced and is not described here. See the `agbench/run_cmd.py` code, or the `GAIA` benchmark tasks files for additional information about this option.


## Task Instance Expansion Algorithm

Once the tasks have been defined, as per above, they must be "instantiated" before they can be run. This instantiation happens automatically when the user issues the `autogenbench run` command and involves creating a local folder to share with Docker. Each instance and repetition gets its own folder along the path: `./results/[scenario]/[task_id]/[instance_id]`. For the sake of brevity we will refer to this folder as the `DEST_FOLDER`.
Once the tasks have been defined, as per above, they must be "instantiated" before they can be run. This instantiation happens automatically when the user issues the `agbench run` command and involves creating a local folder to share with Docker. Each instance and repetition gets its own folder along the path: `./results/[scenario]/[task_id]/[instance_id]`. For the sake of brevity we will refer to this folder as the `DEST_FOLDER`.

The algorithm for populating the `DEST_FOLDER` is as follows:

1. Pre-populate DEST_FOLDER with all the basic starter files for running a scenario (found in `autogenbench/template`).
1. Pre-populate DEST_FOLDER with all the basic starter files for running a scenario (found in `agbench/template`).
2. Recursively copy the template folder specified in the JSONL line to DEST_FOLDER (if the JSON `template` attribute points to a folder) If the JSONs `template` attribute instead points to a file, copy the file, but rename it to `scenario.py`
3. Apply any string replacements, as outlined in the prior section.
4. Write a run.sh file to DEST_FOLDER that will be executed by Docker when it is loaded. The `run.sh` is described below.
Expand Down Expand Up @@ -139,9 +138,8 @@ echo RUN.SH COMPLETE !#!#
Be warned that this listing is provided here for illustration purposes, and may vary over time. The source of truth are the `run.sh` files found in the ``./results/[taskset]/[task_id]/[instance_id]`` folders.


## Integrating with the `tabulate` and `clone` commands.

The above details are sufficient for defining and running tasks, but if you wish to support the `autogenbench tabulate` and `autogenbench clone` commands, a few additional steps are required.
## Integrating with the `tabulate`
The above details are sufficient for defining and running tasks, but if you wish to support the `agbench tabulate` commands, a few additional steps are required.

### Tabulations

Expand All @@ -154,35 +152,10 @@ Should you provide a custom tabulation script, please implement `--help` and `-h
The `scenarios/GAIA/Scripts/custom_tabulate.py` is a great example of custom tabulation. It also shows how you can reuse some components of the default tabulator to speed up development.


### Cloning

If you wish your benchmark to be available via the `autogenbench clone` command, you will need to take three additional steps:

#### Manifest
First, provide a `MANIFEST.json` file in the root of your benchmark. An example is provided below, from which you can see the schema:

```json
{
"files": {
"Templates/TwoAgents/prompt.txt": "Templates/TwoAgents/prompt.txt",
"Templates/TwoAgents/coding/my_tests.py": "Templates/TwoAgents/coding/my_tests.py",
"Templates/TwoAgents/scenario.py": "Templates/TwoAgents/scenario.py",
"README.md": "README.md",
"Scripts/init_tasks.py": "Scripts/init_tasks.py",
"Scripts/custom_tabulate.py": "Scripts/custom_tabulate.py"
}
}
```

The keys of the `files` dictionary are local paths, relative to your benchmark's root directory. The values are relative paths in the AutoGen GitHub repository (relative to the folder where the MANIFEST.json file is located). In most cases, the keys and values will be identical.

#### SCENARIOS dictionary
Second, you must add an entry to the `scenarios` dictionary in `autogen/samples/tools/autogenbench/scenarios/MANIFEST.json`.

#### Scripts/init_tasks.py
Finally, you should provide an `Scripts/init_tasks.py` file, in your benchmark folder, and include a `main()` method therein. This method will be loaded and called automatically by `autogenbench clone` after all manifest files have been downloaded.
## Scripts/init_tasks.py
Finally, you should provide an `Scripts/init_tasks.py` file, in your benchmark folder, and include a `main()` method therein.

This `init_tasks.py` script is a great place to download benchmarks from their original sources and convert them to the JSONL format required by AutoGenBench:
- See `HumanEval/Scripts/init_tasks.py` for an example of how to expand a benchmark from an original GitHub repository.
- See `GAIA/Scripts/init_tasks.py` for an example of how to expand a benchmark from `Hugging Face Hub`.
- See `MATH/SCripts/init_tasks.py` for an example of how to expand a benchmark from an author-hosted website.
122 changes: 79 additions & 43 deletions python/packages/agbench/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,35 @@
# AutoGenBench

AutoGenBench is a tool for repeatedly running a set of pre-defined AutoGen tasks in a setting with tightly-controlled initial conditions. With each run, AutoGenBench will start from a blank slate. The agents being evaluated will need to work out what code needs to be written, and what libraries or dependencies to install, to solve tasks. The results of each run are logged, and can be ingested by analysis or metrics scripts (such as `autogenbench tabulate`). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.
AutoGenBench (agbench) is a tool for repeatedly running a set of pre-defined AutoGen tasks in a setting with tightly-controlled initial conditions. With each run, AutoGenBench will start from a blank slate. The agents being evaluated will need to work out what code needs to be written, and what libraries or dependencies to install, to solve tasks. The results of each run are logged, and can be ingested by analysis or metrics scripts (such as `agbench tabulate`). By default, all runs are conducted in freshly-initialized docker containers, providing the recommended level of consistency and safety.

AutoGenBench works with all AutoGen 0.1.*, and 0.2.* versions.

## Technical Specifications

If you are already an AutoGenBench pro, and want the full technical specifications, please review the [contributor's guide](CONTRIBUTING.md).

If you are already an AutoGenBench pro, and want the full technical specifications, please review the [contributor's guide](CONTRIBUTING.md).

## Docker Requirement

AutoGenBench also requires Docker (Desktop or Engine). **It will not run in GitHub codespaces**, unless you opt for native execution (with is strongly discouraged). To install Docker Desktop see [https://www.docker.com/products/docker-desktop/](https://www.docker.com/products/docker-desktop/).

If you are working in WSL, you can follow the instructions below to set up your environment:

1. Install Docker Desktop. After installation, restart is needed, then open Docker Desktop, in Settings, Ressources, WSL Integration, Enable integration with additional distros – Ubuntu
2. Clone autogen and export `AUTOGEN_REPO_BASE`. This environment variable enables the Docker containers to use the correct version agents.
```bash
git clone [email protected]:microsoft/autogen.git
export AUTOGEN_REPO_BASE=<path_to_autogen>
```

## Installation and Setup

**To get the most out of AutoGenBench, the `autogenbench` package should be installed**. At present, the easiest way to do this is to install it via `pip`:
[Deprecated currently] **To get the most out of AutoGenBench, the `agbench` package should be installed**. At present, the easiest way to do this is to install it via `pip`.

```
pip install autogenbench
```

If you would prefer working from source code (e.g., for development, or to utilize an alternate branch), simply clone the [AutoGen](https://github.com/microsoft/autogen) repository, then install `autogenbench` via:
If you would prefer working from source code (e.g., for development, or to utilize an alternate branch), simply clone the [AutoGen](https://github.com/microsoft/autogen) repository, then install `agbench` via:

```
pip install -e autogen/samples/tools/autogenbench
pip install -e autogen/python/packages/agbench
```

After installation, you must configure your API keys. As with other AutoGen applications, AutoGenBench will look for the OpenAI keys in the OAI_CONFIG_LIST file in the current working directory, or the OAI_CONFIG_LIST environment variable. This behavior can be overridden using a command-line parameter described later.
Expand All @@ -36,7 +42,6 @@ export OAI_CONFIG_LIST=$(cat ./OAI_CONFIG_LIST)

If an OAI_CONFIG_LIST is *not* provided (by means of file or environment variable), AutoGenBench will use the OPENAI_API_KEY environment variable instead.


For some benchmark scenarios, additional keys may be required (e.g., keys for the Bing Search API). These can be added to an `ENV.json` file in the current working folder. An example `ENV.json` file is provided below:

```
Expand All @@ -46,75 +51,106 @@ For some benchmark scenarios, additional keys may be required (e.g., keys for th
```

## A Typical Session

Once AutoGenBench and necessary keys are installed, a typical session will look as follows:

```
autogenbench clone HumanEval
cd HumanEval
autogenbench run Tasks/r_human_eval_two_agents.jsonl
autogenbench tabulate results/r_human_eval_two_agents
```

Where:
- `autogenbench clone HumanEval` downloads and expands the HumanEval benchmark scenario.
- `autogenbench run Tasks/r_human_eval_two_agents.jsonl` runs the tasks defined in `Tasks/r_human_eval_two_agents.jsonl`
- `autogenbench tablue results/r_human_eval_two_agents` tabulates the results of the run

Each of these commands has extensive in-line help via:
Navigate to HumanEval

- `autogenbench --help`
- `autogenbench clone --help`
- `autogenbench run --help`
- `autogenbench tabulate --help`
```bash
cd autogen/python/packages/agbench/benchmarks/HumanEval
```
**Note:** The following instructions are specific to the HumanEval benchmark. For other benchmarks, please refer to the README in the respective benchmark folder, e.g.,: [AssistantBench](benchmarks/AssistantBench/README.md).

**NOTE:** If you are running `autogenbench` from within the repository, you don’t need to run `autogenbench clone`. Instead, navigate to the appropriate scenario folder (e.g., `scenarios/HumanEval`) and run the `Scripts/init_tasks.py` file.

More details of each command are provided in the sections that follow.
Create a file called ENV.json with the following (required) contents (If you're using MagenticOne), if using Azure:

## Cloning Benchmarks
To clone an existing benchmark, simply run:
```
autogenbench clone [BENCHMARK]
```json
{
"CHAT_COMPLETION_KWARGS_JSON": "{}",
"CHAT_COMPLETION_PROVIDER": "azure"
}
```

For example,
You can also use the openai client by replacing the last two entries in the ENV file by:

- `CHAT_COMPLETION_PROVIDER='openai'`
- `CHAT_COMPLETION_KWARGS_JSON` with the following JSON structure:

```json
{
"api_key": "REPLACE_WITH_YOUR_API",
"model": "REPLACE_WITH_YOUR_MODEL"
}
```
autogenbench clone HumanEval

Now initialize the tasks.

```bash
python Scripts/init_tasks.py
```

To see which existing benchmarks are available to clone, run:
Note: This will attempt to download HumanEval


Once the script completes, you should now see a folder in your current directory called `Tasks` that contains one JSONL file per template in `Templates`.

Now to run a specific subset of HumanEval use:

```bash
agbench run Tasks/human_eval_MagenticOne.jsonl
```
autogenbench clone --list

You should see the command line print the raw logs that shows the agents in action To see a summary of the results (e.g., task completion rates), in a new terminal run the following:

```bash
agbench tabulate Results/human_eval_MagenticOne
```

> Note: You might need to log in to HuggingFace to access certain datasets like GAIA. To do this, run `huggingface-cli login` in your terminal and follow the prompts.
Where:

- `agbench run Tasks/human_eval_MagenticOne.jsonl` runs the tasks defined in `Tasks/human_eval_MagenticOne.jsonl`
- `agbench tablue results/human_eval_MagenticOne` tabulates the results of the run

Each of these commands has extensive in-line help via:

- `agbench --help`
- `agbench run --help`
- `agbench tabulate --help`
- `agbench remove_missing --help`

**NOTE:** If you are running `agbench` from within the repository, you need to navigate to the appropriate scenario folder (e.g., `scenarios/HumanEval`) and run the `Scripts/init_tasks.py` file.

More details of each command are provided in the sections that follow.


## Running AutoGenBench

To run a benchmark (which executes the tasks, but does not compute metrics), simply execute:

```
cd [BENCHMARK]
autogenbench run Tasks
agbench run Tasks/*.jsonl
```

For example,

```
cd HumanEval
autogenbench run Tasks
agbench run Tasks/human_eval_MagenticOne.jsonl
```

The default is to run each task once. To run each scenario 10 times, use:

```
autogenbench run --repeat 10 Tasks
agbench run --repeat 10 Tasks/human_eval_MagenticOne.jsonl
```

The `autogenbench` command-line tool allows a number of command-line arguments to control various parameters of execution. Type ``autogenbench -h`` to explore these options:
The `agbench` command-line tool allows a number of command-line arguments to control various parameters of execution. Type ``agbench -h`` to explore these options:

```
'autogenbench run' will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.
'agbench run' will run the specified autogen scenarios for a given number of repetitions and record all logs and trace information. When running in a Docker environment (default), each run will begin from a common, tightly controlled, environment. The resultant logs can then be further processed by other scripts to produce metrics.

positional arguments:
scenario The JSONL scenario file to run. If a directory is specified,
Expand All @@ -140,7 +176,7 @@ options:
The requirements file to pip install before running the scenario.
-d DOCKER_IMAGE, --docker-image DOCKER_IMAGE
The Docker image to use when running scenarios. Can not be used together with --native. (default:
'autogenbench:default', which will be created if not present)
'agbench:default', which will be created if not present)
--native Run the scenarios natively rather than in docker. NOTE: This is not advisable, and should be done
with great caution.
```
Expand Down Expand Up @@ -171,4 +207,4 @@ Within each folder, you will find the following files:

## Contributing or Defining New Tasks or Benchmarks

If you would like to develop -- or even contribute -- your own tasks or benchmarks, please review the [contributor's guide](CONTRIBUTING.md) for complete technical details.
If you would like to develop -- or even contribute -- your own tasks or benchmarks, please review the [contributor&#39;s guide](CONTRIBUTING.md) for complete technical details.
3 changes: 2 additions & 1 deletion python/packages/agbench/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ dependencies = [
"huggingface_hub",
"tabulate",
"azure-identity",
"pandas"
"pandas",
"scipy"
]

[tool.uv]
Expand Down
14 changes: 7 additions & 7 deletions python/packages/agbench/src/agbench/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from typing_extensions import TypedDict

from .remove_missing_cmd import remove_missing_cli
from .run_cmd import run_cli
from .tabulate_cmd import tabulate_cli
from .version import __version__
Expand Down Expand Up @@ -32,6 +33,11 @@ def main(args: Optional[List[str]] = None) -> None:
"description": "tabulate the results of a previous run",
"function": tabulate_cli,
},
{
"command": "remove_missing",
"description": "remove folders with missing results",
"function": remove_missing_cli,
},
{
"command": "--version",
"description": f"print the version of {invocation_cmd}",
Expand Down Expand Up @@ -68,20 +74,14 @@ def main(args: Optional[List[str]] = None) -> None:

{invocation_cmd} is a tool for running and managing AutoGen benchmark scenarios. A typically session might resemble:

{invocation_cmd} clone HumanEval
cd HumanEval
{invocation_cmd} run Tasks/human_eval_two_agents_gpt4.jsonl

which will download the HumanEval benchmark, expand it, and then run the benchmark once with the `human_eval_two_agents_gpt4` configuration.

\
Available COMMANDs include:

{commands_details}

Additionally, you can use the --help option with any command for further command-specific instructions. E.g.,

{invocation_cmd} run --help
{invocation_cmd} clone --help

""".strip()

Expand Down
Loading
Loading