-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds the GAIA benchark to the Testbed. This PR depends on #792 #810
Conversation
… is a first version, and will likely need refinement.
Nice PR! I am reviewing 792 rn, and will move on to this PR later today or tmr. |
…enarios in expand_gaia.py
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #810 +/- ##
==========================================
- Coverage 26.63% 26.44% -0.19%
==========================================
Files 28 28
Lines 3725 3725
Branches 847 847
==========================================
- Hits 992 985 -7
- Misses 2660 2666 +6
- Partials 73 74 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py
Outdated
Show resolved
Hide resolved
@qingyun-wu I'll copy the utils to the utils folder. Will do it in a separate PR. |
…scenario.py Co-authored-by: LeoLjl <[email protected]>
… (microsoft#810) * Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>
… (microsoft#810) * Re-added completion logging when using older versions of autogen. * Extended scenario definitions and templating to include folders. * Prepare collate_human_eval.py for working with group chat scenarios. * Converted HumanEval to the folder-based approach, and added GroupChat scenarios. * Fixed the default termination message. * Fixed another termination condition. * Updated compatible autogen versions. * Added initial support for GAIA benchmark. * Fixed a bug in executing the finalize scripts. * Generalized the template further to support multiple folder copy operations. * Refined GAIA support, and broke scenarios down by difficulty. * Added some experimental scripts for computing metrics over GAIA. This is a first version, and will likely need refinement. * Added instructions for cloning GAIA * Updated README to fix some typos. * Added a script to format GAIA reslts for the leaderboard. * Update samples/tools/testbed/scenarios/GAIA/Templates/BasicTwoAgents/scenario.py Co-authored-by: LeoLjl <[email protected]> --------- Co-authored-by: Qingyun Wu <[email protected]> Co-authored-by: LeoLjl <[email protected]>
Why are these changes needed?
This PR adds initial support for the GAIA benchmark to the Testbed.
Note: This PR depends on #792. Merge that one first.
Related issue number
N/A
Checks