Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the GAIA benchark to the Testbed. This PR depends on #792 #810

Merged
merged 22 commits into from
Dec 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
84897b4
Re-added completion logging when using older versions of autogen.
afourney Nov 16, 2023
76ae8f5
Merge remote-tracking branch 'origin/testbed_add_014_logging' into te…
afourney Nov 18, 2023
014063e
Extended scenario definitions and templating to include folders.
afourney Nov 18, 2023
81c62c9
Prepare collate_human_eval.py for working with group chat scenarios.
afourney Nov 18, 2023
6f0a45f
Converted HumanEval to the folder-based approach, and added GroupChat…
afourney Nov 20, 2023
5e9fe60
Fixed the default termination message.
afourney Nov 20, 2023
f96245a
Fixed another termination condition.
afourney Nov 20, 2023
bf815aa
Merge branch 'main' into testbed_folders
afourney Nov 28, 2023
a95c748
Updated compatible autogen versions.
afourney Nov 28, 2023
b962226
Merge branch 'main' into testbed_folders
afourney Nov 29, 2023
46b0abc
Added initial support for GAIA benchmark.
afourney Nov 29, 2023
53cd5b4
Fixed a bug in executing the finalize scripts.
afourney Nov 29, 2023
2d97bb8
Generalized the template further to support multiple folder copy oper…
afourney Nov 29, 2023
f7b30a9
Merge branch 'testbed_folders' into testbed_gaia
afourney Nov 29, 2023
a22bea7
Refined GAIA support, and broke scenarios down by difficulty.
afourney Nov 29, 2023
b26e26d
Added some experimental scripts for computing metrics over GAIA. This…
afourney Nov 29, 2023
4ee155b
Added instructions for cloning GAIA
afourney Nov 29, 2023
b1118b2
Updated README to describe how to run GAIA. Included the GAIA test sc…
afourney Nov 30, 2023
66cd936
Updated README to fix some typos.
afourney Nov 30, 2023
23fa2ba
Added a script to format GAIA reslts for the leaderboard.
afourney Dec 1, 2023
90101a4
Merge branch 'main' into testbed_gaia
qingyun-wu Dec 5, 2023
08eec2b
Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/…
qingyun-wu Dec 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion samples/tools/testbed/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@ The Testbed also requires Docker (Desktop or Engine) AND the __python docker__ l
To run the Testbed, simply execute
``python run_scenarios.py scenarios/Examples``

The default is to run each scenario once time. To run each scenario 10 times, use:
The default is to run each scenario once. To run each scenario 10 times, use:

``python run_scenarios.py --repeat 10 scenarios/Examples ``

The run_scenarios.py script also allows a number of command-line arguments to control various parameters of execution. Type ``python run_scenarios.py -h`` to explore these options:
Expand Down Expand Up @@ -193,3 +194,25 @@ python ./run_scenarios.py scenarios/HumanEval/human_eval_two_agents_gpt35.jsonl
python utils/collate_human_eval.py ./results/human_eval_two_agents_gpt35 | python utils/metrics_human_eval.py > human_eval_results_gpt35.csv
cat human_eval_results_gpt35.csv
```

## (Example) Running GAIA

The Testbed can also be used to run the recently released [GAIA benchmark](https://huggingface.co/gaia-benchmark). This integration is presently experimental, and needs further validation. In this scenario, agents are presented with a series of questions that may include file references, or multi-modal input. Agents then must provide a `FINAL ANSWER`, which is considered correct if it (nearly) exactly matches an unambiguously accepted answer.

Accessing this scenario-type requires downloading and converting the GAIA dataset, running the Testbed, collating the results, and finally computing the metrics. The following commands will accomplish this, running each test instance once with GPT-4:

```
# Clone the GAIA dataset repo (assuming a 'repos' folder in your home directory)
cd ~/repos
git clone https://huggingface.co/datasets/gaia-benchmark/GAIA

# Expand GAIA
cd ~/repos/autogen/samples/tools/testbed
python ./utils/expand_gaia.py ~/repos/GAIA

# Run GAIA
python ./run_scenarios.py ./scenarios/GAIA/gaia_validation_level_1__two_agents_gpt4.jsonl

# Compute Metrics
python utils/collate_gaia_csv.py ./results/gaia_validation_level_1__two_agents_gpt4 | python utils/metrics_gaia.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
__EXPECTED_ANSWER__
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
import os
import json
import autogen
from datetime import datetime
import testbed_utils

testbed_utils.init()
##############################


GAIA_SYSTEM_MESSAGE = (
"You are a helpful AI assistant, and today's date is "
+ datetime.now().date().isoformat()
+ """.
I will ask you a question. Answer this question using your coding and language skills.
In the following cases, suggest python code (presented in a coding block beginning ```python) or shell script (presented in a coding block beginning ```sh) for the user to execute:
1. When you need to collect info, use the code to output the info you need, for example, browse or search the web, download/read a file, print the content of a webpage or a file, check the operating system. After sufficient info is printed and the task is ready to be solved based on your language skill, you can solve the task by yourself.
2. When you need to perform some task with code, use the code to perform the task and output the result. Finish the task smartly.
Answer the question step if you need to. If a plan is not provided, explain your plan first. Be clear which step uses code, and which step uses your language skill.
The user cannot provide any other feedback or perform any other action beyond executing the code appearing in the code block. The user can't modify your code, so do not suggest incomplete code which requires users to modify. Don't use a code block if it's not intended to be executed by the user. Don't include multiple code blocks in one response. Do not ask users to copy and paste code or results. Instead, use the 'print' function for the output when relevant. Check the execution result reported by the user.
If the result indicates there is an error, fix the error and output the code again. Suggest the full code instead of partial code or code changes. If the error can't be fixed or if the task is not solved even after the code is executed successfully, analyze the problem, revisit your assumption, collect additional info you need, and think of a different approach to try.
When you find an answer, report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER].
YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings.
If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise.
If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise.
If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
""".strip()
)


config_list = autogen.config_list_from_json(
"OAI_CONFIG_LIST",
filter_dict={"model": ["__MODEL__"]},
)

assistant = autogen.AssistantAgent(
"assistant",
system_message=GAIA_SYSTEM_MESSAGE,
is_termination_msg=lambda x: x.get("content", "").rstrip().find("FINAL ANSWER") >= 0,
llm_config=testbed_utils.default_llm_config(config_list, timeout=180),
)
user_proxy = autogen.UserProxyAgent(
"user_proxy",
human_input_mode="NEVER",
is_termination_msg=lambda x: x.get("content", "").rstrip().find("FINAL ANSWER") >= 0,
code_execution_config={
"work_dir": "coding",
"use_docker": False,
},
max_consecutive_auto_reply=10,
default_auto_reply="",
)

filename = "__FILE_NAME__".strip()
question = """
__PROMPT__
""".strip()

if len(filename) > 0:
question = f"Consider the file '{filename}', which can be read from the current working directory. {question}"

user_proxy.initiate_chat(assistant, message=question)


##############################
testbed_utils.finalize(agents=[assistant, user_proxy])
128 changes: 128 additions & 0 deletions samples/tools/testbed/utils/collate_gaia_csv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
import os
import json
import re
import sys
import argparse


def normalize_answer(a):
# Lower case
# Trim (left and right)
# Replace multiple spaces with one space
# Remove trailing punctuation
return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip().lower()))


def collate(results_dir):
"""
Collate the results of running GAIA

Args:
results_dir (path): The folder were results were be saved.
"""

all_results = list()
max_instances = 0

for test_id in os.listdir(results_dir):
test_path = os.path.join(results_dir, test_id)

# Collect the reslts vector
results = [test_id]

instance = 0
instance_dir = os.path.join(test_path, str(instance))
while os.path.isdir(instance_dir):
expected_answer_file = os.path.join(instance_dir, "expected_answer.txt")
if not os.path.isfile(expected_answer_file):
# Expected ansewr is missing
results.append("")

instance += 1
instance_dir = os.path.join(test_path, str(instance))
continue

expected_answer = "!!!NULL ANSWER!!!"
with open(expected_answer_file, "rt") as fh:
expected_answer = fh.read().strip()

console_log_file = os.path.join(instance_dir, "console_log.txt")
if not os.path.isfile(console_log_file):
# Console log file missing
results.append("")

instance += 1
instance_dir = os.path.join(test_path, str(instance))
continue

with open(console_log_file, "rt") as fh:
console_log = fh.read()

final_answer = ""
m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
if m:
final_answer = m.group(1).strip()

# print(f"Expected Answer: {expected_answer}\nAutogen Answer: {final_answer}\n")

if normalize_answer(expected_answer) == normalize_answer(final_answer):
results.append("1")
else:
results.append("-1")

instance += 1
instance_dir = os.path.join(test_path, str(instance))

max_instances = max(max_instances, instance)

# Buffer the results
all_results.append(results)

# Create a header
header = "TestId"
for i in range(0, max_instances):
header += ",Trial" + str(i)
print(header)

# Print a fully-populated table of results
for r in all_results:
while len(r) < max_instances + 1:
r.append("")
print(",".join(r))


###############################################################################
if __name__ == "__main__":
script_path = os.path.realpath(__file__)
script_name = os.path.basename(script_path)
script_dir = os.path.dirname(script_path)

# Path to the default results directory
# (relative to this script, up on directory, then into the results folder)
default_results_dir = os.path.realpath(
os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4")
)

parser = argparse.ArgumentParser(
description=f"""
{script_name} will collate the results of the GAIA scenarios and output them to a CSV. The CSV format is as follows:

TestId, Trial0, Trial1, ..., TrialN
uuid_1, x_10, x_11, ..., X_1N
uuid_2, x_20, x_21, ..., X_2N
...
uuid_M, x_M0, x_M1, ..., X_MN

Where uuid_i is the identifier of the ith test question, and x_ij is 1 or -1 depending on if the test passed or failed, respectively. If data for the trial is missing (e.g., due to a runtime error, the value will be an empty string "".
""".strip(),
formatter_class=argparse.RawTextHelpFormatter,
)

parser.add_argument(
"scenario",
nargs="?",
help="Path to the scenario results. (default: " + default_results_dir + ")",
default=default_results_dir,
)
args = parser.parse_args()
collate(args.scenario)
76 changes: 76 additions & 0 deletions samples/tools/testbed/utils/collate_gaia_jsonl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import os
import json
import re
import sys
import argparse


def normalize_answer(a):
# Trim (left and right)
# Replace multiple spaces with one space
# Remove trailing punctuation
# Trim again
return re.sub(r"[\.\!\?]+$", "", re.sub(r"\s+", " ", a.strip())).strip()


def collate(results_dir, instance=0):
"""
Collate the results of running GAIA. Print the results in the format acceped by the leaderboard.

Args:
results_dir (path): The folder were results were be saved.
"""

for test_id in os.listdir(results_dir):
test_path = os.path.join(results_dir, test_id)

instance_dir = os.path.join(test_path, str(instance))
console_log_file = os.path.join(instance_dir, "console_log.txt")

final_answer = ""
if os.path.isfile(console_log_file):
with open(console_log_file, "rt") as fh:
console_log = fh.read()

final_answer = ""
m = re.search(r"FINAL ANSWER:(.*?)\n", console_log, re.DOTALL)
if m:
final_answer = normalize_answer(m.group(1))

# Clean up the GAIA logs so they don't have the Docker setup preamble
m = re.search(r"^.*?\r?\n(user_proxy \(to assistant\).*$)", console_log, re.DOTALL)
if m:
console_log = m.group(1)

print(json.dumps({"task_id": test_id, "model_answer": final_answer, "reasoning_trace": console_log}))


###############################################################################
if __name__ == "__main__":
script_path = os.path.realpath(__file__)
script_name = os.path.basename(script_path)
script_dir = os.path.dirname(script_path)

# Path to the default results directory
# (relative to this script, up on directory, then into the results folder)
default_results_dir = os.path.realpath(
os.path.join(script_dir, os.path.pardir, "results", "gaia_validation_level_1__two_agents_gpt4")
)

parser = argparse.ArgumentParser(
description=f"""
{script_name} will collate the results of the GAIA scenarios into the jsonl format that can be submit to the GAIA leaderboard.

NOTE: You will likely need to concatenate resuls for level 1, level 2 and level 3 to form a complete submission.
""".strip(),
formatter_class=argparse.RawTextHelpFormatter,
)

parser.add_argument(
"scenario",
nargs="?",
help="Path to the scenario results. (default: " + default_results_dir + ")",
default=default_results_dir,
)
args = parser.parse_args()
collate(args.scenario)
Loading
Loading