Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223

Merged
merged 11 commits into from
Nov 25, 2024

Conversation

openhands-agent
Copy link
Contributor

@openhands-agent openhands-agent commented Nov 23, 2024

This PR fixes #5222 by reorganizing the evaluation directory structure to improve clarity and maintainability.

Changes

  • Created evaluation/benchmarks/ directory to house all ML literature benchmarks
  • Kept utility directories (utils, integration_tests, regression, static) directly under evaluation/
  • Updated paths in documentation and GitHub workflows to reflect the new structure
  • Added missing benchmarks to evaluation/README.md:
    • Commit0 and DiscoveryBench under Software Engineering
    • Browsing Delegation under Web Browsing
    • ScienceAgentBench under Misc. Assistance

Testing

  • All pre-commit hooks pass (ruff, mypy, etc.)
  • All unit tests pass (377 tests)

Review Notes

Key files to review:

  • .github/workflows/eval-runner.yml - Updated paths for integration tests and benchmarks
  • evaluation/README.md - Added missing benchmarks and updated paths
  • Documentation files - Updated references to benchmark paths

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:b28439d-nikolaik   --name openhands-app-b28439d   docker.all-hands.dev/all-hands-ai/openhands:b28439d

@neubig neubig marked this pull request as ready for review November 23, 2024 13:50
@neubig
Copy link
Contributor

neubig commented Nov 23, 2024

Just noting that I have confirmed the code and it looks good to me, but I'd like a second review.

@neubig neubig requested a review from enyst November 23, 2024 18:26
@enyst
Copy link
Collaborator

enyst commented Nov 23, 2024

I'm trying to run 1 instance of swe-bench on this PR, and I get this error:

....venv/bin/python: can't open file '/Users/enyst/repos/odie/evaluation/swe_bench/run_infer.py': [Errno 2] No such file or directory

The shell scripts, every run_infer.sh in benchmarks directories, need updated, to run the right .py file:

e.g ./evaluation/benchmarks/swe-bench/scripts/run_infer.sh contains the line

COMMAND="poetry run python evaluation/swe_bench/run_infer.py \

@enyst enyst added the fix-me Attempt to fix this issue with OpenHands label Nov 23, 2024
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor Author

New OpenHands update

Copy link
Contributor

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

@enyst
Copy link
Collaborator

enyst commented Nov 23, 2024

@openhands-agent The previous comment was fixed for swe-bench.
For every other benchmark that was moved from ./evaluation directory to ./evaluation/benchmarks/, we need to verify if there are scripts that hard-coded the old structure when calling some other script. For example, in ./evaluation/benchmarks/webarena, we can search for .sh scripts, and we will find:

./evaluation/benchmarks/webarena/scripts/run_infer.sh

Read the script. We will find a line like:

COMMAND="poetry run python evaluation/webarena/run_infer.py \

This will fail because the new location is in evaluation/benchmarks/webarena/ directory.

You are a smart LLM, you understand patterns. Same pattern is repeated for these benchmarks, give or take that some have more files than others. Verify each benchmark's shell scripts, and modify accordingly.

Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@enyst enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor Author

New OpenHands update

Copy link
Contributor

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

@enyst
Copy link
Collaborator

enyst commented Nov 23, 2024

The previous comments were fixed for shell scripts.

After the refactoring of benchmarks from ./evaluation directory to ./evaluation/benchmarks/, it is important that human users are still able to run easily these benchmarks. In every benchmark directory, there should be a README.md file. For example, in ./evaluation/benchmarks/swe-bench, there is a README.md with instructions how to set up and run the swe-bench benchmark.

You can read it, and see it has, for example, a line like this:

./evaluation/swe_bench/scripts/run_infer.sh [model_config] [git-version] [agent] [eval_limit] [max_iter] [num_workers] [dataset] [dataset_split]

If the human user copies and pastes that line with their own data, it will fail to run the script, because of course the swe-bench run_infer.sh script has moved to ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh.

You're a smart LLM and you know patterns, remember. All these benchmarks are very similar and follow the same patterns for documentation for the human users. You can verify every .md file (not only README, check for more, maybe) in each benchmark and update for this particular move. Keep it minimal, only solve this particular issue.

@enyst enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024
@openhands-agent
Copy link
Contributor Author

New OpenHands update

Copy link
Contributor

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

@enyst
Copy link
Collaborator

enyst commented Nov 23, 2024

Not all the README.md issue (or other .md) files discussed in the previous comment were updated for human users, after the move of the benchmarks.

Some were, some were not. So we need to fix the rest.

But we have a problem. You're good, but you were not allowed to use the "str_replace_editor" tool when "old_str" is not unique in the file, so many of your attempted replacements were not performed. You had to go back and include more context. Then they were performed, but you ran out of time.

You need to understand this very well, so that we do better this time. Remember, we are refactoring the ./evaluation directory to house every benchmark under ./evaluation/benchmarks. Remember the previous comment about documentation for human users: this is what we fix now.

Usually, there were more than one occurrence of the pattern in each file (such as "/evaluation/swe_bench" to be updated to the new location). It is possible there were two occurrences, or more, where one is the syntax of the command to run the benchmark, and another is a particular example of running the command.

First, think how to do this better this time. You have two options:

  • you can take more context every time you want to use the str_replace_editor tool. Not a lot more, just more than the patterned string itself.
  • use the bash tool instead, with a bash command to perform the replacements in each .md file all at once. In this case, you need to find them first, to know what you need to do.

Make a decision, then perform it. Do not ask me about it.

@enyst enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor Author

New OpenHands update

Copy link
Contributor

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

@enyst
Copy link
Collaborator

enyst commented Nov 23, 2024

You're good! Your choice to run bash scripts was brilliant!

You fixed the rest of the documentation for human users in only 4 steps this time.

Now. You did very well, and I think all we have left is to double check that there are no leftover old paths. If there are, we need to fix them.

Leftovers could be in:

  • .py files. This is by far the most important! The module names can break everything. You did so well with the bash tool previously, that I suggest you try it again to verify all imports in .py files or any other occurrences of the patterns, in the benchmarks files.
  • .sh scripts. Verify them and fix if necessary.
  • You can skip .md files this time, you did a very good verification last time.
  • Other files.

Remember to first look at all benchmarks, as we moved them from ./evaluation to ./evaluation/benchmarks/, so that you know what you work with.

Check in order and update as needed.

FINALLY:
When you finish with all of them, I want you to make an extra check:

  • in the root directory this time, there is a file called CREDITS.md. Read it. It probably doesn't have paths to update, that's okay, but it does have a list of benchmarks and some other details about them. Update the file according to what you have learned about our current benchmarks.

@enyst enyst added fix-me Attempt to fix this issue with OpenHands and removed fix-me Attempt to fix this issue with OpenHands labels Nov 23, 2024
Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor Author

New OpenHands update

Copy link
Contributor

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

@enyst
Copy link
Collaborator

enyst commented Nov 23, 2024

@openhands-agent-exp Your bash skills may be impressive, but you got lazy last time and you forgot under what constraints you were supposed to be working! I had to revert your last attempt.

Be careful. We have only a couple things left to fix here:

  1. In our new ./evaluation/benchmarks directory, find and read carefully:
  • from Gaia, run_infer.sh
  • from mint, README.md
  • from swe-bench, eval_infer.sh
    You know how to find the exact files. Note that they may be in a subdirectory. Read those files and if they need fixes, fix them. I prefer you just take it easy with str_replace_editor tool this time. Just be precise! They're half-updated probably, so make sure you don't overwrite what is already up to date.
  1. In the root directory, there is a file called CREDITS.md. Read it. It probably doesn't have paths to update, that's okay, but it does have a list of benchmarks and some other details about them. Update the file according to what you have learned about our current benchmarks. Do not duplicate references. Make sure you add missing ones.

Remember to list all benchmarks first, it helps you know what to work with.

Copy link
Contributor

OpenHands started fixing the pr! You can monitor the progress here.

@openhands-agent
Copy link
Contributor Author

New OpenHands update

Copy link
Contributor

The workflow to fix this issue encountered an error. Please check the workflow logs for more information.

Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got a little too fun! 😅

  • I think there may still be some small misses. They should be easily fixable when encountered later though, up to you if you feel we need more rounds here. However:
  • IMHO this is where the eval workflow with no-matter-what-llm could come in very handy. Maybe for 1) one instance and 2) 2+ workers (because there are some pieces of code that only get executed either if it's a single worker or multi)
  • I ran locally 1 and I'm trying to run 2, run_infer. I didn't run eval_infer. I don't know why it seems slow/blocking today. I don't succeed with 2.
  • the credits.md update failed spectacularly, twice[1]. I reverted it and I don't think it has to be part of this PR. I was going to make a licensing review later anyway, and it will be manual, clearly; I just got a bit too optimistic when I saw how well it kept track of all benchmarks we have. 😅
  • Sonnet with bash tool is amazing!

Edited to add:
[1] Possibly one of the highest rates of hallucinations-tokens per total tokens I've seen 😹

@neubig
Copy link
Contributor

neubig commented Nov 25, 2024

Hey @enyst , thanks so much for cleaning this up!

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@neubig neubig merged commit 678436d into main Nov 25, 2024
17 checks passed
@neubig neubig deleted the openhands-fix-issue-5222 branch November 25, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix-me Attempt to fix this issue with OpenHands
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Refactor]: Refactor the evaluation directory
4 participants