-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue #5222: [Refactor]: Refactor the evaluation directory #5223
Conversation
…arks while keeping other directories directly under evaluation/
Just noting that I have confirmed the code and it looks good to me, but I'd like a second review. |
I'm trying to run 1 instance of swe-bench on this PR, and I get this error:
The shell scripts, every run_infer.sh in benchmarks directories, need updated, to run the right .py file: e.g ./evaluation/benchmarks/swe-bench/scripts/run_infer.sh contains the line
|
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
@openhands-agent The previous comment was fixed for swe-bench.
Read the script. We will find a line like:
This will fail because the new location is in evaluation/benchmarks/webarena/ directory. You are a smart LLM, you understand patterns. Same pattern is repeated for these benchmarks, give or take that some have more files than others. Verify each benchmark's shell scripts, and modify accordingly. |
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
The previous comments were fixed for shell scripts. After the refactoring of benchmarks from ./evaluation directory to ./evaluation/benchmarks/, it is important that human users are still able to run easily these benchmarks. In every benchmark directory, there should be a README.md file. For example, in ./evaluation/benchmarks/swe-bench, there is a README.md with instructions how to set up and run the swe-bench benchmark. You can read it, and see it has, for example, a line like this:
If the human user copies and pastes that line with their own data, it will fail to run the script, because of course the swe-bench run_infer.sh script has moved to ./evaluation/benchmarks/swe_bench/scripts/run_infer.sh. You're a smart LLM and you know patterns, remember. All these benchmarks are very similar and follow the same patterns for documentation for the human users. You can verify every .md file (not only README, check for more, maybe) in each benchmark and update for this particular move. Keep it minimal, only solve this particular issue. |
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
Not all the README.md issue (or other .md) files discussed in the previous comment were updated for human users, after the move of the benchmarks. Some were, some were not. So we need to fix the rest. But we have a problem. You're good, but you were not allowed to use the "str_replace_editor" tool when "old_str" is not unique in the file, so many of your attempted replacements were not performed. You had to go back and include more context. Then they were performed, but you ran out of time. You need to understand this very well, so that we do better this time. Remember, we are refactoring the ./evaluation directory to house every benchmark under ./evaluation/benchmarks. Remember the previous comment about documentation for human users: this is what we fix now. Usually, there were more than one occurrence of the pattern in each file (such as "/evaluation/swe_bench" to be updated to the new location). It is possible there were two occurrences, or more, where one is the syntax of the command to run the benchmark, and another is a particular example of running the command. First, think how to do this better this time. You have two options:
Make a decision, then perform it. Do not ask me about it. |
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
You're good! Your choice to run bash scripts was brilliant! You fixed the rest of the documentation for human users in only 4 steps this time. Now. You did very well, and I think all we have left is to double check that there are no leftover old paths. If there are, we need to fix them. Leftovers could be in:
Remember to first look at all benchmarks, as we moved them from ./evaluation to ./evaluation/benchmarks/, so that you know what you work with. Check in order and update as needed. FINALLY:
|
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
…ation directory" This reverts commit 4136c53.
@openhands-agent-exp Your bash skills may be impressive, but you got lazy last time and you forgot under what constraints you were supposed to be working! I had to revert your last attempt. Be careful. We have only a couple things left to fix here:
Remember to list all benchmarks first, it helps you know what to work with. |
New OpenHands update |
The workflow to fix this issue encountered an error. Please check the workflow logs for more information. |
fee148e
to
b28439d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This got a little too fun! 😅
- I think there may still be some small misses. They should be easily fixable when encountered later though, up to you if you feel we need more rounds here. However:
- IMHO this is where the eval workflow with no-matter-what-llm could come in very handy. Maybe for 1) one instance and 2) 2+ workers (because there are some pieces of code that only get executed either if it's a single worker or multi)
- I ran locally 1 and I'm trying to run 2,
run_infer
. I didn't runeval_infer
. I don't know why it seems slow/blocking today. I don't succeed with 2. - the credits.md update failed spectacularly, twice[1]. I reverted it and I don't think it has to be part of this PR. I was going to make a licensing review later anyway, and it will be manual, clearly; I just got a bit too optimistic when I saw how well it kept track of all benchmarks we have. 😅
- Sonnet with
bash
tool is amazing!
Edited to add:
[1] Possibly one of the highest rates of hallucinations-tokens per total tokens I've seen 😹
Hey @enyst , thanks so much for cleaning this up! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR fixes #5222 by reorganizing the evaluation directory structure to improve clarity and maintainability.
Changes
evaluation/benchmarks/
directory to house all ML literature benchmarksutils
,integration_tests
,regression
,static
) directly underevaluation/
Testing
Review Notes
Key files to review:
.github/workflows/eval-runner.yml
- Updated paths for integration tests and benchmarksevaluation/README.md
- Added missing benchmarks and updated pathsTo run this PR locally, use the following command: