Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

openhands-agent · 2024-11-23T13:15:59Z

This PR fixes #5222 by reorganizing the evaluation directory structure to improve clarity and maintainability.

Changes

Created evaluation/benchmarks/ directory to house all ML literature benchmarks
Kept utility directories (utils, integration_tests, regression, static) directly under evaluation/
Updated paths in documentation and GitHub workflows to reflect the new structure
Added missing benchmarks to evaluation/README.md:
- Commit0 and DiscoveryBench under Software Engineering
- Browsing Delegation under Web Browsing
- ScienceAgentBench under Misc. Assistance

Testing

All pre-commit hooks pass (ruff, mypy, etc.)
All unit tests pass (377 tests)

Review Notes

Key files to review:

.github/workflows/eval-runner.yml - Updated paths for integration tests and benchmarks
evaluation/README.md - Added missing benchmarks and updated paths
Documentation files - Updated references to benchmark paths

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b28439d-nikolaik   --name openhands-app-b28439d   docker.all-hands.dev/all-hands-ai/openhands:b28439d

…arks while keeping other directories directly under evaluation/

neubig · 2024-11-23T13:51:21Z

Just noting that I have confirmed the code and it looks good to me, but I'd like a second review.

evaluation/README.md

…rks/

enyst · 2024-11-23T18:57:04Z

I'm trying to run 1 instance of swe-bench on this PR, and I get this error:

....venv/bin/python: can't open file '/Users/enyst/repos/odie/evaluation/swe_bench/run_infer.py': [Errno 2] No such file or directory

The shell scripts, every run_infer.sh in benchmarks directories, need updated, to run the right .py file:

e.g ./evaluation/benchmarks/swe-bench/scripts/run_infer.sh contains the line

COMMAND="poetry run python evaluation/swe_bench/run_infer.py \

github-actions · 2024-11-23T18:59:37Z

OpenHands started fixing the pr! You can monitor the progress here.

…rectory

openhands-agent · 2024-11-23T19:02:45Z

New OpenHands update

github-actions · 2024-11-23T19:02:46Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst · 2024-11-23T19:19:15Z

@openhands-agent The previous comment was fixed for swe-bench.
For every other benchmark that was moved from ./evaluation directory to ./evaluation/benchmarks/, we need to verify if there are scripts that hard-coded the old structure when calling some other script. For example, in ./evaluation/benchmarks/webarena, we can search for .sh scripts, and we will find:

./evaluation/benchmarks/webarena/scripts/run_infer.sh

Read the script. We will find a line like:

COMMAND="poetry run python evaluation/webarena/run_infer.py \

This will fail because the new location is in evaluation/benchmarks/webarena/ directory.

You are a smart LLM, you understand patterns. Same pattern is repeated for these benchmarks, give or take that some have more files than others. Verify each benchmark's shell scripts, and modify accordingly.

github-actions · 2024-11-23T19:19:40Z

OpenHands started fixing the pr! You can monitor the progress here.

github-actions · 2024-11-23T19:32:06Z

OpenHands started fixing the pr! You can monitor the progress here.

…rectory

openhands-agent · 2024-11-23T19:35:26Z

New OpenHands update

github-actions · 2024-11-23T19:35:27Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst · 2024-11-23T20:00:50Z

The previous comments were fixed for shell scripts.

After the refactoring of benchmarks from ./evaluation directory to ./evaluation/benchmarks/, it is important that human users are still able to run easily these benchmarks. In every benchmark directory, there should be a README.md file. For example, in ./evaluation/benchmarks/swe-bench, there is a README.md with instructions how to set up and run the swe-bench benchmark.

You can read it, and see it has, for example, a line like this:

./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

If the human user copies and pastes that line with their own data, it will fail to run the script, because of course the swe-bench run_infer.sh script has moved to ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh.

You're a smart LLM and you know patterns, remember. All these benchmarks are very similar and follow the same patterns for documentation for the human users. You can verify every .md file (not only README, check for more, maybe) in each benchmark and update for this particular move. Keep it minimal, only solve this particular issue.

openhands-agent · 2024-11-23T20:08:06Z

New OpenHands update

github-actions · 2024-11-23T20:08:06Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst · 2024-11-23T21:00:38Z

Not all the README.md issue (or other .md) files discussed in the previous comment were updated for human users, after the move of the benchmarks.

Some were, some were not. So we need to fix the rest.

But we have a problem. You're good, but you were not allowed to use the "str_replace_editor" tool when "old_str" is not unique in the file, so many of your attempted replacements were not performed. You had to go back and include more context. Then they were performed, but you ran out of time.

You need to understand this very well, so that we do better this time. Remember, we are refactoring the ./evaluation directory to house every benchmark under ./evaluation/benchmarks. Remember the previous comment about documentation for human users: this is what we fix now.

Usually, there were more than one occurrence of the pattern in each file (such as "/evaluation/swe_bench" to be updated to the new location). It is possible there were two occurrences, or more, where one is the syntax of the command to run the benchmark, and another is a particular example of running the command.

First, think how to do this better this time. You have two options:

you can take more context every time you want to use the str_replace_editor tool. Not a lot more, just more than the patterned string itself.
use the bash tool instead, with a bash command to perform the replacements in each .md file all at once. In this case, you need to find them first, to know what you need to do.

Make a decision, then perform it. Do not ask me about it.

github-actions · 2024-11-23T21:01:09Z

OpenHands started fixing the pr! You can monitor the progress here.

…rectory

openhands-agent · 2024-11-23T21:03:25Z

New OpenHands update

github-actions · 2024-11-23T21:03:26Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst · 2024-11-23T21:36:43Z

You're good! Your choice to run bash scripts was brilliant!

You fixed the rest of the documentation for human users in only 4 steps this time.

Now. You did very well, and I think all we have left is to double check that there are no leftover old paths. If there are, we need to fix them.

Leftovers could be in:

.py files. This is by far the most important! The module names can break everything. You did so well with the bash tool previously, that I suggest you try it again to verify all imports in .py files or any other occurrences of the patterns, in the benchmarks files.
.sh scripts. Verify them and fix if necessary.
You can skip .md files this time, you did a very good verification last time.
Other files.

Remember to first look at all benchmarks, as we moved them from ./evaluation to ./evaluation/benchmarks/, so that you know what you work with.

Check in order and update as needed.

FINALLY:
When you finish with all of them, I want you to make an extra check:

in the root directory this time, there is a file called CREDITS.md. Read it. It probably doesn't have paths to update, that's okay, but it does have a list of benchmarks and some other details about them. Update the file according to what you have learned about our current benchmarks.

github-actions · 2024-11-23T21:37:18Z

OpenHands started fixing the pr! You can monitor the progress here.

…rectory

openhands-agent · 2024-11-23T21:44:22Z

New OpenHands update

github-actions · 2024-11-23T21:44:23Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

…ation directory" This reverts commit 4136c53.

enyst · 2024-11-23T22:07:57Z

@openhands-agent-exp Your bash skills may be impressive, but you got lazy last time and you forgot under what constraints you were supposed to be working! I had to revert your last attempt.

Be careful. We have only a couple things left to fix here:

In our new ./evaluation/benchmarks directory, find and read carefully:

from Gaia, run_infer.sh
from mint, README.md
from swe-bench, eval_infer.sh
You know how to find the exact files. Note that they may be in a subdirectory. Read those files and if they need fixes, fix them. I prefer you just take it easy with str_replace_editor tool this time. Just be precise! They're half-updated probably, so make sure you don't overwrite what is already up to date.

In the root directory, there is a file called CREDITS.md. Read it. It probably doesn't have paths to update, that's okay, but it does have a list of benchmarks and some other details about them. Update the file according to what you have learned about our current benchmarks. Do not duplicate references. Make sure you add missing ones.

Remember to list all benchmarks first, it helps you know what to work with.

github-actions · 2024-11-23T22:08:11Z

OpenHands started fixing the pr! You can monitor the progress here.

openhands-agent · 2024-11-23T22:13:13Z

New OpenHands update

github-actions · 2024-11-23T22:13:14Z

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

enyst

This got a little too fun! 😅

I think there may still be some small misses. They should be easily fixable when encountered later though, up to you if you feel we need more rounds here. However:
IMHO this is where the eval workflow with no-matter-what-llm could come in very handy. Maybe for 1) one instance and 2) 2+ workers (because there are some pieces of code that only get executed either if it's a single worker or multi)
I ran locally 1 and I'm trying to run 2, run_infer. I didn't run eval_infer. I don't know why it seems slow/blocking today. I don't succeed with 2.
the credits.md update failed spectacularly, twice[1]. I reverted it and I don't think it has to be part of this PR. I was going to make a licensing review later anyway, and it will be manual, clearly; I just got a bit too optimistic when I saw how well it kept track of all benchmarks we have. 😅
Sonnet with bash tool is amazing!

Edited to add:
[1] Possibly one of the highest rates of hallucinations-tokens per total tokens I've seen 😹

neubig · 2024-11-25T13:30:51Z

Hey @enyst , thanks so much for cleaning this up!

xingyaoww

LGTM

Fix issue #5222: [Refactor]: Refactor the evaluation directory

2a3460c

github-actions bot mentioned this pull request Nov 23, 2024

[Refactor]: Refactor the evaluation directory #5222

Closed

openhands-agent added 2 commits November 23, 2024 13:31

Update paths to reference evaluation/benchmarks/ directory for benchm…

ecb1d81

…arks while keeping other directories directly under evaluation/

Add missing benchmarks to evaluation/README.md

4c62196

neubig requested review from xingyaoww and mamoodi November 23, 2024 13:49

neubig marked this pull request as ready for review November 23, 2024 13:50

neubig mentioned this pull request Nov 23, 2024

docs: improve evaluation README with proper links and formatting #5221

Merged

1 task

enyst self-requested a review November 23, 2024 16:25

enyst reviewed Nov 23, 2024

View reviewed changes

evaluation/README.md Show resolved Hide resolved

Fix imports to match new directory structure under evaluation/benchma…

a759dd8

…rks/

neubig requested a review from enyst November 23, 2024 18:26

enyst added the fix-me Attempt to fix this issue with OpenHands label Nov 23, 2024

Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…

b07b554

…rectory

enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024

Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…

655af80

…rectory

enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024

enyst mentioned this pull request Nov 23, 2024

[Bug]: [Resolver] Multi-line GitHub comments break the macro #5230

Open

1 task

enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024

Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…

c6627f8

…rectory

enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024

Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evaluation di…

4136c53

…rectory

Revert "Fix pr #5223: Fix issue #5222: [Refactor]: Refactor the evalu…

1eb6d07

…ation directory" This reverts commit 4136c53.

keep other fixes, undo credits

b28439d

enyst force-pushed the openhands-fix-issue-5222 branch from fee148e to b28439d Compare November 23, 2024 22:49

enyst approved these changes Nov 23, 2024

View reviewed changes

enyst mentioned this pull request Nov 24, 2024

[Resolver] Include all comments when running the macro... maybe #5236

Open

xingyaoww approved these changes Nov 25, 2024

View reviewed changes

neubig merged commit 678436d into main Nov 25, 2024
17 checks passed

neubig deleted the openhands-fix-issue-5222 branch November 25, 2024 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

openhands-agent commented Nov 23, 2024 •

edited by github-actions bot

Loading

neubig commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst left a comment •

edited

Loading

neubig commented Nov 25, 2024

xingyaoww left a comment

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Conversation

openhands-agent commented Nov 23, 2024 • edited by github-actions bot Loading

Changes

Testing

Review Notes

neubig commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

openhands-agent commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

enyst left a comment • edited Loading

Choose a reason for hiding this comment

neubig commented Nov 25, 2024

xingyaoww left a comment

Choose a reason for hiding this comment

openhands-agent commented Nov 23, 2024 •

edited by github-actions bot

Loading

enyst left a comment •

edited

Loading